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Abstract: We formulate the notion of minimax estimation under storage or communication 
constraints, and prove an extension to Pinsker’s theorem for nonparametric estimation over 
Sobolev ellipsoids. Placing limits on the number of bits used to encode any estimator, we 
give tight lower and upper bounds on the excess risk due to quantization in terms of the num¬ 
ber of bits, the signal size, and the noise level. This establishes the Pareto optimal tradeoff 
between storage and risk under quantization constraints for Sobolev spaces. Our results and 
proof techniques combine elements of rate distortion theory and minimax analysis. The pro¬ 
posed quantized estimation scheme, which shows achievability of the lower bounds, is adaptive 
in the usual statistical sense, achieving the optimal quantized minimax rate without knowledge 
of the smoothness parameter of the Sobolev space. It is also adaptive in a computational sense, 
as it constructs the code only after observing the data, to dynamically allocate more codewords 
to blocks where the estimated signal size is large. Simulations are included that illustrate the 
effect of quantization on statistical risk, nonparametric estimation, minimax bounds, rate dis¬ 
tortion theory, constrained estimation, Sobolev ellipsoid 


1. Introduction 

In this paper we introduce a minimax framework for nonparametric estimation under storage con¬ 
straints. In the classical statistical setting, the minimax risk for estimating a function / from a 
function class T using a sample of size n places no constraints on the estimator /„, other than 
requiring it to be a measurable function of the data. However, if the estimator is to be constructed 
with restrictions on the computational resources used, it is of interest to understand how the er¬ 
ror can degrade. Letting C( f n ) < B n indicate that the computational resources C( f n ) used to 
construct f n are required to fall within a budget B n , the constrained minimax risk is 

R n (J r ,B n ) = ^ inf sup R(f n , /)• 

fn-CUn)<B n /eJ- 
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Minimax lower bounds on the risk as a function of the computational budget thus determine a 
feasible region for computation constrained estimation, and a Pareto optimal tradeoff for risk versus 
computation as B n varies. 

Several recent papers have presented results on tradeoffs between statistical risk and compu¬ 
tational resources, measured in terms of either running time of the algorithm, number of floating 
point operations, or number of bits used to store or construct the estimators [5,6, 16]. However, the 
existing work quantifies the tradeoff by analyzing the statistical and computational performance of 
specific procedures, rather than by establishing lower bounds and a Pareto optimal tradeoff. In this 
paper we treat the case where the complexity C(f n ) is measured by the storage or space used by 
the procedure and sharply characterize the optimal tradeoff. Specifically, we limit the number of 
bits used to represent the estimator f n . We focus on the setting of nonparametric regression under 
standard smoothness assumptions, and study how the excess risk depends on the storage budget 

B n - 

We view the study of quantized estimation as a theoretical problem of fundamental interest. 
But quantization may arise naturally in future applications of large scale statistical estimation. For 
instance, when data are collected and analyzed on board a remote satellite, the estimated values 
may need to be sent back to Earth for further analysis. To limit communication costs, the estimates 
can be quantized, and it becomes important to understand what, in principle, is lost in terms of sta¬ 
tistical risk through quantization. A related scenario is a cloud computing environment where data 
are processed for many different statistical estimation problems, with the estimates then stored for 
future analysis. To limit the storage costs, which could dominate the compute costs in many sce¬ 
narios, it is of interest to quantize the estimates, and the quantization-risk tradeoff again becomes 
an important concern. Estimates are always quantized to some degree in practice. But to impose 
energy constraints on computation, future processors may limit precision in arithmetic computa¬ 
tions more significantly [11]; the cost of limited precision in terms of statistical risk must then be 
quantified. A related problem is to distribute the estimation over many parallel processors, and to 
then limit the communication costs of the submodels to the central host. We focus on the central¬ 
ized setting in the current paper, but an extension to the distributed case may be possible with the 
techniques that we introduce here. 

We study risk-storage tradeoffs in the normal means model of nonparametric estimation assum¬ 
ing the target function lies in a Sobolev space. The problem is intimately related to classical rate 
distortion theory [12], and our results rely on a marriage of minimax theory and rate distortion 
ideas. We thus build on and refine the connection between function estimation and lossy source 
coding that was elucidated in David Donoho’s 1998 Wald Lectures [9]. 

We work in the Gaussian white noise model 

dX(t)=f(t)dt + edW(t), 0 < t < 1, (1.1) 

where IE is a standard Wiener process on [0,1], £ is the standard deviation of the noise, and / 
lies in the periodic Sobolev space IE(m, c) of order m and radius c. (We discuss the nonperiodic 


2 


Sobolev space W(m, c ) in Section 4.) The white noise model is a centerpiece of nonparametric 
estimation. It is asymptotically equivalent to nonparametric regression [4] and density estimation 
[17], and simplifies some of the mathematical analysis in our framework. In this classical setting, 
the minimax risk of estimation 


R E {m, c) = inf slip E||/ - f £ \\ 2 2 

f e f£W(m,c) 


is well known to satisfy 


_ 4 m . . 

lime 2 ™+iR £ (m, c) = 


£—>0 


c 2 (2m + 1) \ 2m+ 


7r 


2m 


m 


m + 1 


2m + l ^ 

— r r 


( 1 . 2 ) 


where P m c is Pinsker’s constant [18]. The constrained minimax risk for quantized estimation be¬ 
comes 

R £ (m,c,B £ ) = _ inf sup E||/ — ^||1 

fe,C(f e )<B e f£W(m,c) 

where f £ is a quantized estimator that is required to use storage C(f £ ) no greater than B e bits in 
total. Our main result identifies three separate quantization regimes. 

_ 2 

• In the over-sufficient regime, the number of bits is very large, satisfying B e e 2rn + 1 and 

4 m 

the classical minimax rate of convergence R e x £ 2m +i is obtained. Moreover, the optimal 
constant is the Pinsker constant P m , c - 

2 

• In the sufficient regime, the number of bits scales as B £ x e 2 ™- 1 . This level of quantization 
is just sufficient to preserve the classical minimax rate of convergence, and thus in this regime 

4 m 

R e (m,c, B e ) x £ 2 ™+!. However, the optimal constant degrades to a new constant P m c + 

Q m c ,d, where Q m ,c,d is characterized in terms of the solution of a certain variational problem, 

2 

depending on d = lim £ ^ 0 5 £ £ 2m+1 . 

2 

• In the insufficient regime, the number of bits scales as B £ <C £ 2m + 1 , with however B e —y oo. 
Under this scaling the number of bits is insufficient to preserve the unquantized minimax rate 
of convergence, and the quantization error dominates the estimation error. We show that the 
quantized minimax risk in this case satisfies 


UmB e m R e (m,c,B e ) = 


c 2 m 2m 


£—^0 


7T 


2m 


Thus, in the insufficient regime the quantized minimax rate of convergence is B e , with 
optimal constant as shown above. 

By using an upper bound for the family of constants Q m ,c,d, the three regimes can be combined 
together to view the risk in terms of a decomposition into estimation error and quantization error. 
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Specifically, we can write 


R £ (m,c,B £ ) « P m , c £ 2m+1 + 

^ -' 


estimation error 



quantization error 
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When B £ e 2m + 1 , the estimation error dominates the quantization error, and the usual minimax 

2 

rate and constant are obtained. In the insufficient case B e <C e 2, " +1 , only a slower rate of conver- 

2 

gence is achievable. When B £ and e 2m +! are comparable, the estimation error and quantization 

2 

error are on the same order. The threshold e 2rn + l should not be surprising, given that in clas¬ 
sical unquantized estimation the minimax rate of convergence is achieved by estimating the first 

2 

e 2 m+i Fourier coefficients and simply setting the remaining coefficients to zero. This corresponds 


to selecting a smoothing bandwidth that scales as h x n~ 2m + 1 with the sample size n. 

At a high level, our proof strategy integrates elements of minimax theory and source coding 
theory. In minimax analysis one computes lower bounds by thinking in Bayesian terms to look 
for least-favorable priors. In source coding analysis one constructs worst case distributions by 
setting up an optimization problem based on mutual information. Our quantized minimax analysis 
requires that these approaches be carefully combined to balance the estimation and quantization 
errors. To show achievability of the lower bounds we establish, we likewise need to construct an 
estimator and coding scheme together. Our approach is to quantize the blockwise James-Stein 
estimator, which achieves the classical Pinsker bound. However, our quantization scheme differs 
from the approach taken in classical rate distortion theory, where the generation of the codebook is 
determined once the source distribution is known. In our setting, we require the allocation of bits to 
be adaptive to the data, using more bits for blocks that have larger signal size. We therefore design 
a quantized estimation procedure that adaptively distributes the communication budget across the 
blocks. Assuming only a lower bound m 0 on the smoothness m and an upper bound c 0 on the 
radius c of the Sobolev space, our quantization-estimation procedure is adaptive to m and c in the 
usual statistical sense, and is also adaptive to the coding regime. In other words, given a storage 
budget B e , the coding procedure achieves the optimal rate and constant for the unknown m and c, 
operating in the corresponding regime for those parameters. 

In the following section we establish some notation, outline our proof strategy, and present some 
simple examples. In Section 3 we state and prove our main result on quantized minimax lower 
bounds, relegating some of the technical details to an appendix. In Section 4 we show asymptotic 
achievability of these lower bounds, using a quantized estimation procedure based on adaptive 
James-Stein estimation and quantization in blocks, again deferring proofs of technical lemmas to 
the supplementary material. This is followed by a presentation of some results from experiments 
in Section 5, illustrating the performance and properties of the proposed quantized estimation 
procedure. 
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2. Quantized estimation and minimax risk 

Suppose that (AY..., X n ) € X n is a random vector drawn from a distribution P n . Consider the 
problem of estimating a functional 9 n = 9(P n ) of the distribution, assuming 9 n is restricted to 
lie in a parameter space 0 n . To unclutter some of the notation, we will suppress the subscript n 
and write 9 and 0 in the following, keeping in mind that nonparametric settings are allowed. The 
subscript n will be maintained for random variables. The minimax (f risk of estimating 9 is then 
defined as 

.R n (0) = inf supEflH# — 9 n \\ 2 

0 n 0E© 

where the infimum is taken over all possible estimators 9 n : X n —> 0 that are measurable with 
respect to the data Xi,, X n . We will abuse notation by using 9 n to denote both the estimator and 
the estimate calculated based on an observed set of data. Among numerous approaches to obtaining 
the minimax risk, the Bayesian method is best aligned with quantized estimation. Consider a prior 
distribution 7 r(9) whose support is a subset of 0. Let S(X 1:n ) be the posterior mean of 9 given the 
data X l5 ..., AY, which minimizes the integrated risk. Then for any estimator 9 n , 

sup Eg) || 6? - 9 n || 2 > /' E fl ||0 - 9 n \\ 2 d7T(9) > [ E e \\9 - S^X^fd-K^). 

Taking the infimum over 9 n yields 

inf sup Eg||# - 9 n || 2 > / E e ||6> - 5(X l , n )\\ 2 dTi(9) = f? n (0; n). 

9 n ee© J® 

Thus, any prior distribution supported on 0 gives a lower bound on the minimax risk, and selecting 
the least-favorable prior leads to the largest lower bound provable by this approach. 

Now consider constraints on the storage or communication cost of our estimate. We restrict to 
the set of estimators that use no more than a total of B n bits; that is, the estimator takes at most 2 B " 
different values. Such quantized estimators can be formulated by the following two-step procedure. 
First, an encoder maps the data X\., n to an index (!) n (X\- n ), where 

4>n : X n —>• { 1 , 2 ,..., 2 Sn } 

is the encoding function. The decoder , after receiving or retrieving the index, represents the esti¬ 
mates based on a decoding function 

Yn:{l,2,...,2 B "}^0, 

mapping the index to a codebook of estimates. All that needs to be transmitted or stored is the B n - 
bit-long index, and the quantized estimator 9 n is simply w n o <j>„, the composition of the encoder 
and the decoder functions. Denoting by C(9 n ) the storage, in terms of the number of bits, required 
by an estimator 9 n , the minimax risk of quantized estimation is then defined as 

R n (Q,B n )= inf supEY|0 — 9 n \\ 2 , 

e n ,c(e n )<B n e»e© 
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and we are interested in the effect of the constraint on the minimax risk. Once again, we consider 
a prior distribution ir(0) supported on 0 and let 5(X 1;ri ) be the posterior mean of 9 given the data. 
The integrated risk can then be decomposed as 

I Ee\\9 - 9 n \\ 2 dn(9) = E||0 - 5(X 1:n ) + 5(X 1:n ) - 9 n || 2 

J B (2.1) 

= E\\9 - 5(X l:n )\\ 2 + E\\5(X 1:n ) -9 n \\ 2 

where the expectation is with respect to the joint distribution of 9 ~ ir(9) and X l:n | 9 ~ P e , and 
the second equality is due to 

E (9-5(X 1:n ),8(X l:n )-9 n ) 

= E (E ((9 - 5{X 1:n ), 5{X 1:n ) - 9 n ) \ X 1:n )) 

= E (<E (9 - 5(X 1:n ) | X 1:n ), 5(X 1:n ) - 9 n )) 

= E(<O,<5(X 1:n )-0 n » =0, 

using the fact that 9 —> X\ n —>• 9 n forms a Markov chain. The first term in the decomposition (2.1) 
is the Bayes risk /?,„((-); tt). The second term can be viewed as the excess risk due to quantization. 

Let T n = T(X i, ..., X n ) be a sufficient statistic for 9. The posterior mean can be expressed in 
terms of T n and we will abuse notation and write it as d(T n ). Since the quantized estimator 9 n uses 
at most B n bits, we have 

Bn > H(9 n ) > H(9 n ) - H(9 n \ 5{T n )) = I(9 n ; 5(T n )), 

where H and / denote the Shannon entropy and mutual information, respectively. Now consider 
the optimization 

inf E\\5(T n )-9 n \\ 2 

P(-|5(T„)) 

such that I(9 n -,5(T n )) < B n 

where the infimum is over all conditional distributions P(9 n \ S(T n )). This parallels the definition 
of the distortion rate function, minimizing the distortion under a constraint on mutual information 
[12]. Denoting the value of this optimization by Q n (0, B n ; 7r), we can lower bound the quantized 
minimax risk by 

R n (0, B n ) > -R„(0; 7r) + Qn(0, B n ; n). 

Since each prior distribution ir(9) supported on 0 gives a lower bound, we have 

i?„(0,B n ) > sup[i? ri (0; 7r) + Q n (Q,B n -,7T)\ 

and the goal becomes to obtain a least favorable prior for the quantized risk. 

Before turning to the case of quantized estimation over Sobolev spaces, we illustrate this tech¬ 
nique on some simpler, more concrete examples. 
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Example 2.1 (Normal means in a hypercube). Let X, ~ M(9. a 2 I d ) for i = 1.2,..., n. Suppose 
that a 2 is known and 9 G [—r, r] d is to be estimated. We choose the prior "(9) on 9 to be a product 
distribution with density 


d 


tt(9) = n 

3 = 1 




It is shown in [15] that 


R n (Q;ir) > 


cr 2 d t 2 
n r 2 + 12cr 2 /n 


> Ci 


cr 2 d 


n 


where c\ = ^^2 - Turning to Q n {Q,B n ; tt), let = (Tf n \ ..., ) = E(9\Xi :n ) be the 

posterior mean of 9. In fact, by the independence and symmetry among the dimensions, we know 
T\,... ,T ( j are independently and identically distributed. Denoting by T 0 this common distribu¬ 
tion, we have 


Qni 0, Bn] tt) > d ■ q{B n /d) 

where q(B ) is the distortion rate function for T ( \ n \ i.e., the value of the following problem 

jnf E(T 0 (n) - T ) 2 

P(T|T 0 (n) ) 

such that I(T ; Tq 1 ^) < B. 


Now using the Shannon lower bound [8], we get 

r) > f-■ 2'^"'). . 

Zne 

Note that as n ^ oo, converges to 9 in distribution, so there exists a constant c 2 independent 
of n and d such that 

Rn(e,B n )>c 1 — + c 2 d2- 2J t. 
n 

This lower bound intuitively shows the risk is regulated by two factors, the estimation error and 
the quantization error; whichever is larger dominates the risk. The scaling behavior of this lower 
bound (ignoring constants) can be achieved by first quantizing each of the d intervals [—r, r] using 
B n /d bits each, and then mapping the MLE to its closest codeword. 

Example 2.2 (Gaussian sequences in Euclidean balls). In the example shown above, the lower 
bound is tight only in terms of the scaling of the key parameters. In some instances, we are able to 
find an asymptotically tight lower bound for which we can show achievability of both the rate and 
the constants. Estimating the mean vector of a Gaussian sequence with an ( 2 norm constraint on 
the mean is one of such case, as we showed in previous work [27]. 

Specifically, let Xj ~ Af(9,. a 2 )fori — 1, 2,..., n, where a 2 = a 2 /n. Suppose that the param¬ 
eter 9 = (6*i,..., 9 n ) lies in the Euclidean ball 0„(c) = {9 : Yfi=i $1 < c 2 }. Furthermore, suppose 
that B r , = nB. Then using the prior 9i ~ A/"(0, c 2 ) it can be shown that 

cr 2 c 2 c 4 2~ 2B 

liminf R n (Q n (c),B n ) > — + — . 

n->oo + C 2 O 2, + C 2 
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The asymptotic estimation error o 2 c 2 / (a 2 + c 2 ) is the well-known Pinsker bound for the Euclidean 
ball case. As shown in [27], an explicit quantization scheme can be constructed that asymptotically 
achieves this lower bound, realizing the smallest possible quantization error c A 2~ 2B / (a 2 + c 2 ) for 
a budget of B n = nB bits. 

The Euclidean ball case is clearly relevant to the Sobolev ellipsoid case, but new coding strate¬ 
gies and proof techniques are required. In particular, as will be made clear in the sequel, we will 
use an adaptive allocation of bits across blocks of coefficien ts, using more bits for blocks that have 
larger estimated signal size. Moreover, determination of the optimal constants requires a detailed 
analysis of the worst case prior distributions and the solution of a series of variational problems. 

3. Quantized estimation over Sobolev spaces 

Recall that the Sobolev space of order m and radius c is defined by 

W(m, c) — {/ G [0,1] —* M : is absolutely continuous and 

j\f m Hx)) 2 dx<c 2 }. 

The periodic Sobolev space is defined by 

W(m, c) = {fe W(m, c) : /<*>(0) = 1), j = 0,1,..., m - l} . (3.1) 

The white noise model (1.1) is asymptotically equivalent to making n equally spaced observations 
along the sample path, Y t = f(i/n) + a£i, where e t ~ A/"(0,1) [4]. In this formulation, the noise 
level in the formulation (1.1) scales as e 2 = cr 2 /n, and the rate of convergence takes the familiar 

_ 2m 

form n 2m + l where n is the number of observations. 

To carry out quantized estimation we now require an encoder 

4: Ml 0 ’ 1 !—>{1,2,...,2 b ‘} 

which is a function applied to the sample path Xft). The decoding function then takes the form 

V’e : 2, ■ ■ ■, 2 Be } —>■ R^ 0,1 ! 


and maps the index to a function estimate. As in the previous section, we write the composition 
of the encoder and the decoder as f e = if £ o 0 e , which we call the quantized estimator. The 
communication or storage C( f e ) required by this quantized estimator is no more than B £ bits. 

To recast quantized estimation in terms of an infinite sequence model, let be the trigono¬ 

metric basis, and let 

0 j = [ <Pj(t)f(t)dt, j = 1,2,..., 

Jo 


8 



be the Fourier coefficients. It is well known [22] that / = QjVj belongs to W(m. c ) if and 
only if the Fourier coefficients 6 belong to the Sobolev ellipsoid defined as 

{ °o r 2 ) 

0(m, C )= (3-2) 

where 

fj m , for even j, 

j \ (j — l) m , for odd j. 

Although this is the standard definition of a Sobolev ellipsoid, for the rest of the paper we will set 
dj = j m , j = 1,2,... for convenience of analysis. All of the results hold for both definitions of 
a,j. Also note that (3.2) actually gives a more general definition, since m is no longer assumed to 
be an integer, as it is in (3.1). Expanding with respect to the same orthonormal basis, the observed 
path X (t) is converted into an infinite Gaussian sequence 

Yj = [ <Pj(t)dX(t), j = 1,2,..., 

J 0 

with Yj ~ M(6 n c 2 ). For an estimator (Oj)f =] of {Yj)°? =1 , an estimate of / is obtained by 

OO 

f(x ) = ^2^j ( Pji.x) 

3 =1 

with squared error 11 / — /||| = 11 9 — 9 \ ||. In terms of this standard reduction, the quantized minimax 
risk is thus reformulated as 


R £ (m,c,B s )=^ inf sup E 0 ||6> - d e \\\. 

6» £ ,C(6» e )<B £ 6>e0(m,c) 


(3.3) 


To state our result, we need to define the value of the following variational problem: 


V A 

v m,c,d — 


(3.4) 


max 


r x o a 2 (x) 


( 1 f x 0 

-dx + xq exp — / log 
V Xn J o 


dx~ 2 d\ 


\x 0 Jo a 2 (x) +1 x 0 J 


{o 2 ,xo)£F{m,c,d) Jo (J 2 {x) + 1 

where the feasible set lF(m,c,d) is the collection of increasing functions o 2 (x) and values xq 
satisfying 


rx o 

/ x 2m a 2 (x)dx < c 2 
Jo 


^ > exp (A r log Xbl , dx - for all * < r„. 


cr 2 (a;) + 1 


x 0 Jo cr 2 (x) + 1 x 0 


The significance and interpretation of the variational problem will become apparent as we outline 
the proof of this result. 
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Theorem 3.1. Let R £ (m,c, B e ) be defined as in (3.3), for m > 0 and c > 0. 

2 

(i) If B £ e 2 ™+' —y oo as e —> 0, then 

liminf £~ 2 ^R e (m 1 c, B e ) > P rn . c 

where P m ,c is Pinker’s constant defined in (1.2). 

2 

(ii) If B f £ 2 "’+ ] —y dfor some constant das e 0, then 


liminf £ 2m + 1 R e fui, C, B e ) f P m ,c ~t~ Q m,c,d V m,c,d 

where \I m ,c,d is the value of the variational problem (3.4). 


(iii) If B £ e 2 ™+^ —> 0 and B e —> oo as e —> 0, then 


lim inf B 2m R £ (m, c, B e ) > 


c 2, m 2m 


£—^0 


7r 


2m 


In the first regime where the number of bits B £ is much greater than £ _2 ™+i, we recover the 
same convergence result as in Pinsker’s theorem, in terms of both convergence rate and leading 
constant. The proof of the lower bound for this regime can directly follow the proof of Pinsker’s 
theorem, since the set of estimators considered in our minimax framework is a subset of all possible 
estimators. 

In the second regime where we have “just enough” bits to preserve the rate, we suffer a loss in 

_ 4 m 

terms of the leading constant. In this “Goldilocks regime,” the optimal rate e 2m + 1 is achieved but 
the constant in front of the rate is Pinsker’s constant P m c plus a positive quantity Qm, c ,d determined 
by the variational problem. 

While the solution to this variational problem does not appear to have an explicit form, it can 
be computed numerically. We discuss this term at length in the sequel, where we explain the origin 
of the variational problem, compute the constant numerically and approximate it from above and 
below. The constants P m c and Q m , c ,d ar c shown graphically in Figure 1. Note that the parameter d 

can be thought of as the average number of bits per coefficient used by an optimal quantized esti- 

_ 2 

mator, since e 2m + 1 is asymptotically the number of coefficients needed to estimate at the classical 
minimax rate. As shown in Figure 1, the constant for quantized estimation quickly approaches the 
Pinsker constant as d increases—when d — 3 the two are already very close. 

In the third regime where the communication budget is insufficient for the estimator to achieve 
the optimal rate, we obtain a sub-optimal rate which no longer depends explicitly on the noise level 

e of the model. In this regime, quantization error dominates, and the risk decays at a rate of B~^ 

2 

no matter how fast e approaches zero, as long as B c 2 ™+'. Here the analogue of Pinsker’s 
constant takes a very simple form. 
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FlG 1. The constants P m ,c + Qm,c,d os a function of quantization level d in the sufficient regime, where B e e 2m + l —» d. 

The parameter d can be thought of as the average number of bits per coefficient used by an optima! quantized estimator, 
— 2 

because e 2 ™+ 1 is asymptotically the number of coefficients needed to estimate at the classical minimax rate. Here 
we take m = 2 and c 2 /7T 2m = 1. The curve indicates that with only 2 bits per coefficient, optimal quantized minimax 
estimation degrades by less than a factor of 2 in the constant. With 3 bits per coefficient, the constant is very close to 
the classical Pinsker constant. 


Proof of Theorem 3.1. Consider a Gaussian prior distribution on 6 = (9j)ff l with 9j ~ A^(0, a 2 ) 
for j = 1, 2,..., in terms of parameters a 2 = (cr |)°^ 1 to be specified later. One requirement for 
the variances is 

°o „2 

X—■N 9 9 . c 


We denote this prior distribution by 7 r(9; a 2 ), and show in Section A that it is asymptotically con¬ 
centrated on the ellipsoid @(m, c). Under this prior the model is 

9j ~ Af (0, cr|) 

YjlOj-Mr(e v £ 2 ), j = 1 , 2 ,... 

and the marginal distribution of Y 3 is thus A/"(0, a 2 + e 2 ). Following the strategy outlined in Sec¬ 
tion 2, let 3 denote the posterior mean of 6 given Y under this prior, and consider the optimization 

inf E||5-0 || 2 
such that 1(5:6) < B e 

where the infimum is over all distributions on 9 such that 6 —> Y —» 9 forms a Markov chain. Now, 
the posterior mean satisfies 5j = 7 fY 3 where 7 3 = a 2 /( a 2 + e 2 ). Note that the Bayes risk under 
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this prior is 


00 G-o 

E||9-«III = E 3 


pi a i+ e 


Define 


^ E(5j - OjY 


Then the classical rate distortion argument [8] gives that 


hw+?y 


OO 1 

= E o lo s 

3= 1 z 


cr’ 


+ { M](^ + e 2 ), 


where log + (x) = max(logx, 0). Therefore, the quantized minimax risk is lower bounded by 

R £ (m,c,B £ ) =^ inf sup E||0 - d £ \\ 2 > V e (B £ , m, c)(l + o(l)) 

0 e ,C(9 s )<B e 6>e0(m,c) 

where 14(14, m, c ) is the value of the optimization 


max iifin S>5 + £ 


00 o 2 e 2 


3 = 1 
00 2 


such that ^2 g 


3=i a 3 +£ 

at 


j= 1 “ 

00 

i=i 


/*?(*?+ e 2 ), 

c 2 


< B £ 


7T 


2m 


(Pi) 


and the (1 + o(l)) deviation term is analyzed in the supplementary material. 
Observe that the quantity 14 (B e , m, c ) can be upper and lower bounded by 


ma x\R e (m, c ), Q £ (m, c, B e ) \ < V £ (m, c, B e ) < R e (m, c) + Q e (m, c, B £ 


(3.5) 


where the estimation error term R e (m, c ) is the value of the optimization 


00 a 2 e 2 
max V' —- 

- 2 h a 3 +e 


such that ^2 a j a ] — 
3 = 1 


7 r 


2m 


(Pi) 
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and the quantization error term Q e (m , c, B e ) is the value of the optimization 


OO 

max min V" // 

a 2 a 2 . , 

* j=l 

oo ^ 


cr 


3 


such that j:- 2 ^ 2) 


OO n 2 

S>M< 

j=i 


<5, 


(Qi) 


7 r 


2m 


The following results specify the leading order asymptotics of these quantities. 

Lemma 3.2. As e — > 0, 

R e (m,c ) = Pm, c £ 2m+1 (1 + o(l)). 

Lemma 3.3. As e —> 0, 


2™ 2m 


Q £ (m,c,B e ) < 

2 

Moreover, if B £ s 2m +' —>■ 0 a/7<i £> e —> oo, 

Qe(m, c, B e ) = 


cm 


7r 


2m 


—2m 


(1 + 0 ( 1 )). 


(3.6) 


^2 my, 2m 

___^- 2 m(i _|_ 0 ( 1 )). 


This yields the following closed form upper bound. 
Corollary 3.4. Suppose that B E —? oc and s — > 0. TTie/z 

V e (m, c, B e ) < ( P TO ,c£ 2m+1 + 


r .2 ,yyy 27TL 

c ,n jg-2 m 

,77-2 m £ 


(1 + 0 ( 1 )). 


(3.7) 


In the insufficient regime B e e 2 ™+ 1 —>■ 0 and £> e —* oo as £ —>• 0, equation (3.5) and Lemma 3.3 
show that 

r ,2 rn 2m 

J (1 + 0(1)). 


r ,2 rn 2m 

V e {m^B e ) = —Bf™' 


7T 


Similarly, in the over-sufficient regime B e e 2 ™+1 —>■ oo as e — > 0, we conclude that 

V e (m, c, B e ) = P m ,c £ 2 ^tt (1 + o(l)). 

2 

We now turn to the sufficient regime —>• a. We begin by making three observations 

about the solution to the optimization (P x ). First, we note that the series (VxJ)Jii that solves (Pi) 
can be assumed to be decreasing. If (aj) were not in decreasing order, we could rearrange it to 
be decreasing, and correspondingly rearrange (pj), without violating the constraints or changing 
the value of the optimization. Second, we note that given (cr|), the optimal ip 2 - ) is obtained by the 
“reverse water-filling” scheme [8]. Specifically, there exists rj > 0 such that 


V 


if 


/+ = 


aj + £ 2 


> r) 




a 3 + 


otherwise, 
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where 77 is chosen so that 


-I OO 

o E lo § 

3 = 1 


07 




<B e . 


Third, there exists an integer J > 0 such that the optimal series (a 2 ) satisfies 

4 

9 

f > 77 , for j = 1,..., J and a = 0, for j > J, 
cr 2 + £ 2 2 

where 77 is the “water-filling level” for (// 2 ) (see [ 8 ]). Using these three observations, the optimiza¬ 
tion ( Vi ) can be reformulated as 


J a 2 e 2 

max J 77 + —- 

pi a] + e 2 


1 J 

such that - log H 




3 =1 


n(a 2 + e 2 ) 


= II 


E«M< 


7=1 


7T 


2 m 


cn 


(cr 2 ) is decreasing and 9 
J cry + £ 2 


> 77 . 


To derive the solution to (P 2 )» we use a continuous approximation of a 2 , writing 


(V 2 ) 


o'? = v 2 (jh)h 


2m+l 


where h is the bandwidth to be specified and cr 2 (-) is a function defined on (0, 00 ). The constraint 

:r= 1 


that a ]° 2 , — becomes the integral constraint [18] 


ro o 

/ a: 2 m a 2 (a:)da; < 

Jo 


r 2m " 


/0 7T Z 

We now set the bandwidth so that h 2m+1 = e 2 . This choice of bandwidth will balance the two terms 
in the objective function, and thus gives the hardest prior distribution. Applying the above three 
observations under this continuous approximation, we transform problem (P 2 ) to the following 
optimization: 

*0 a 2 (x) 


f x 0 cr*[x) 

max x 0 r]+ / dx 

cr 2 ,x 0 Jo (J Z [X) + 1 

<j 4 (x) 


r x 0 1 

such that / - log , 

Jo 2 


r](a 2 (x) + 1) 


= d 


xo 

x 2m *\x)dx < 


(V :i ) 


a (x) is decreasing and 


<r 4 (x) 
<j 2 (x) + 1 


> 77 for all x < xq. 
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o 

Note that here we omit the convergence rate h = c 2m +' in the objective function. The asymptotic 
equivalence between (P 2 ) and ('P :i ) can be established by a similar argument to Theorem 3.1 in [9]. 
Solving the first constraint for // yields 


max 


f x o cr 2 (a;) 


o- 2 ,aio 9o cr 2 (a;) + 1 


/ 1 

dx + xq exp — / log 
\x 0 Jo 


<j 4 (x) 2 d 

-drc- 


a 2 (x) + 1 x 0 


such that 


rx o 

/ a: 2m a 2 (a:)dx < 
Jo 


IT■ 


2m 


a (x) is decreasing 

cr 4 (a;) / 1 f x o <j 4 (x) , 2 d 

> exp ( - / log 777 s - 


cr 2 (a;) +1 V^o “'o a 2 (a;) +1 x 0 

for all a; < x 0 . 


(Pa) 


The following is proved using a variational argument in the supplementary material. 
Lemma 3.5. The solution to CP 4 ) satisfies 

_ 1 _ +exp fl r log g4(j) dx -1 _ g2 W + 2 _ = Al 

( cr 2 (x ) + l) 2 P V^o Jo S (x 2 (a:) + 1 x 0 J a 2 (x) (a 2 (x) + 1) 

for some A > 0. 

Fixing xq, the lemma shows that by setting 


a = exp 



2 d 

Xq 


we can express cr 2 (x) implicitly as the unique positive root of a third-order polynomial in y. 


Xx 2 m y 3 + (2Xx 2m - a)y 2 + {Xx 2m - 3a - 1 )y - 2a. 


This leads us to an explicit form of a 2 (x) for a given value a. However, note that a still depends 
on a 2 (x) and x 0 , so the solution a 2 (x) might not be compatible with a and x 0 . We can either 
search through a grid of values of a and xq, or, more efficiently, use an iterative method to find the 
pair of values that gives us the solution. We omit the details on how to calculate the values of the 

optimization as it is not main purpose of the paper. 

2 

To summarize, in the regime B £ e 2rn + 1 a as e —>• 0, we obtain 

V £ (m, C, B e ) = (P m,c + Q m,c,d) (1 + o(l)), 


where we denote by P m>c + Q m , c ,d the values of the optimization (P 4 ). 
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4. Achievability 

In this section, we show that the lower bounds in Theorem 3.1 are achievable by a quantized 
estimator using a random coding scheme. The basic idea of our quantized estimation procedure is 
to conduct blockwise estimation and quantization together, using a quantized form of James-Stein 
estimator. 

Before we present a quantized form of the James-Stein estimator, let us first consider a class of 
simple procedures. Suppose that 9 = 9(X) is an estimator of 0 e 0(m, c) without quantization. 
We assume that 6 6 0(m, c), as projection always reduces mean squared error. To design a B- 
bit quantized estimator, let 0 be the optimal 5-covering of the parameter space 0(m, c) such that 
| 0 | < 2 s , that is, 

5 = 5(B) = inf sup inf 110 — 9' II. 

ece:|©|< 2 s 6>ee 0'e© 

The quantized estimator is then defined to be 

9 = 9(X) = argmin ||0(X) — 9 '||. 
e»'e© 

Now the mean squared error satisfies 

E*||0 - 9\\ 2 = E fl ||9 -9 + 9- 9\\ 2 < 2E e \\9 - 9\\ 2 + 2E e \\9 - 9\\ 2 < 2su P E fl ,||0 - 9'\\ 2 + 25(B) 2 . 

9' 

If we pick 9 to be a minimax estimator for 0, the first term above gives the minimax risk for 
estimating 9 in the parameter space 0. The second term is closely related to the metric entropy 
of the parameter space @(m, c). In fact, for the Sobolev ellipsoid 0(m, c), it is shown in [9] that 
5(B) 2 = B~ 2m (l + o(l)) as B —>• oo. Thus, with an extra constant factor of 2, the mean 
squared error of this quantized estimator is decomposed into the minimax risk for 0 and an error 
term due to quantization. In addition to the fact that this procedure does not achieve the exact lower 
bound of the minimax risk for the constrained estimation problem, it is not clear how such an e- 
net can be generated. In what follows we will describe a quantized estimation procedure that we 
will show achieves the lower bound with the exact constants, and that also adapts to the unknown 
parameters of the Sobolev space. 

We begin by defining the block system to be used, which is usually referred to as the weakly 
geometric system of blocks [22]. Let N e = [_1 /e 2 J and p £ = (log(l/e)) _1 . Let Ji,..., J K be a 
partition of the set {1,..., N £ } such that 

K 

(J Jk = {1, • • •, N £ }, J kl fi J k2 = 0 for k\ f k 2 , 

k =1 

and min{j : j G J k } > max {.7 : j e J k - 1 }- 
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Let Ti, be the cardinality of the Ath block and suppose that T 1: ... ,T k satisfy 


Ti = br 1 ! = riog(i/e)i, 

T 2 = LTi(l + p e )J, 

(4.1) 

Tk-i= L7\(l + p £ )*- 2 J, 

K -1 

Tk = N s -Y: T k . 

k =1 


Then K < CTog 2 (l/e) (see Lemma A. 4). For an infinite sequence x G £ 2 , denote by xr k ) the 
vector ( Xj)j £ j k e M Tfe . We also write j k = J2i=i 7) + 1, which is the smallest index in block J k . 
The weakly geometric system of blocks is defined such that the size of the blocks does not grow 
too quickly (the ratio between the sizes of the neighboring two blocks goes to 1 asymptotically), 
and that the number of the blocks is on the logarithmic scale with respect to 1/e (K < log 2 (l/c)). 
See Lemma A.4. 

We are now ready to describe the quantized estimation scheme. We first give a high-level de¬ 
scription of the scheme, and then the precise specification. In contrast to rate distortion theory, 
where the codebook and allocation of the bits are determined once the source distribution is known, 
here the codebook and allocation of bits are adaptive to the data—more bits are used for blocks 
having larger signal size. The first step in our quantization scheme is to construct a “base code” 
of 2 Be randomly generated vectors of maximum block length T K , with A/"(0,1) entries. The base 
code is thought of as a 2 Be x T K random matrix Z\ it is generated before observing any data, 
and is shared between the sender and receiver. After observing data (Yj), the rows of Z are ap¬ 
portioned to different blocks k = 1 ,... ,K, with more rows being used for blocks having larger 
estimated signal size. To do so, the norm ||Y(fc)|| of each block k is first quantized as a discrete 
value S k . A subcodebook Z k is then constructed by normalizing the appropriate rows and the first 
T k columns of the base code, yielding a collection of random points on the unit sphere S Tfe_1 . To 
form a quantized estimate of the coefficients in the block, the codeword Z^) £ 2 k having the 
smallest angle to Y( k ) is then found. The appropriate indices are then transmitted to the receiver. 

To decode and reconstruct the quantized estimate, the receiver first recovers the quantized norms 
(S k ), which enables reconstruction of the subdivision of the base code that was used by the en¬ 
coder. After extracting for each block k the appropriate row of the base code, the codeword Z( k) is 
reconstructed, and a James-Stein type estimator is then calculated. 

The quantized estimation scheme is detailed below. 

Step 1. Base code generation. 

1.1. Generate codebook S k = {\/T k e 2 +i£ 2 : i — 0,1,... ,s k ] where s k = \e~ 2 c(j k 7 r) -m ], 
for k — 1,..., K. 

1.2. Generate base code Z, a 2 B x T K matrix with i.i.d. J\f( 0,1) entries. 
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(Sk) and Z are shared between the encoder and the decoder, before seeing any data. 
Step 2. Encoding. 


2.1. Encoding block radius. For k — 1,..., K, encode 
Sk = argmin{|s — Sk\ : s G Sk} where 


S k = sJT k e 2 + c(j fc 7r)- 
1^)11 


if||%ll <Vn? 

if ||Y( fc )|| > y/T k e 2 + c(j fc 7r)" m 

otherwise. 


2.2. Allocation of bits. Let ( b k ) k= i be the solution to the optimization 


K 


v in E 


mm 

b 


k= 1 
K 


— T k s 

C2 

J k 


2\2 


■ 2 


~2b k 


such that ^2 T k b k < B, b k > 0. 


k =1 


(4.2) 


2.3. Encoding block direction. Form the data-dependent codebook as follows. 


Divide the rows of Z into blocks of sizes ..., 2^ Tk ^ k \ Based on the /;:th 

block of rows, construct the data-dependent codebook Z k by keeping only the first 
T k entries and normalizing each truncated row; specifically, the jth row of Z k is 
given by 


^k,j 






e §T fe -i 


where i is the appropriate row of the base code Z and Z lA , l denotes the first t entries 
of the row vector. A graphical illustration is shown below in Figure 2. 


With this data-dependent codebook, encode 


Z(k) = axgmax{(z, Y ( fc)) : ^ G Z k } 


for k = 1 ,..., K. 
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2 r^i&ii 



FlG 2. An illustration of the data-dependent codebook. The big matrix represents the base code Z, and the shaded 
areas are (Z k ), sub-matrices of size T' k x 2^ Tkbk ^ with rows normalized. 

Step 3. Transmission. Transmit or store (, Sk)k=i and {Z(k))k=i by their corresponding indices. 
Step 4. Decoding & Estimation. 

4.1. Recover (S k ) based on the transmitted or stored indices and the common codebook 

(S k ). 

4.2. Solve (4.2) and get (b k ). Reconstruct (Z k ) using Z and (b k ). 

4.3. Recover (Z^)) based on the transmitted or stored indices and the reconstructed code¬ 
book (Z k ). 

4.4. Estimate 9( k ) by 

2 Sl-T k e \ A 0 _^ ~ 

0(k) = --VI - 2 Zbk ■ Z(jk). 

Sfc 

4.5. Estimate the entire vector 6 by concatenating the 9( k ) vectors and padding with zeros; 
thus, 

9 = (0(i),... , 0 (a), 0,0,...). 

The following theorem establishes the asymptotic optimality of this quantized estimator. 

Theorem 4.1. Let 9 be the quantized estimator defined above. 

2 

(i) If Be 2m + 1 —» oo, then 

lim e~ 2 m+i SU p E||0 — 9 1| 2 = P m ,c- 

0£@{m,c) 

2 

(ii) If Be 2m +! —> dfor some constant d as e 0, then 

lim £"W sup E||0 - 0 1| 2 = P m , c + Q dl m,c- 

£_>0 0e0(m,c) 
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2 

(iii) If Be 2m + 1 —> 0 and £>(log(l/£))~ 3 —>■ oo, then 


iim B 2m sup n\e-e \\ 2 = —^r- 

£ ^ 0e@(m,c) tt 

The expectations are with respect to the random quantized estimation scheme Q and the distribu¬ 
tion of the data. 

We pause to make several remarks on this result before outlining the proof. 

Remark 4.1. The total number of bits used by this quantized estimation scheme is 

J2 \ T kbk \ + J2 log\e- 2 c(j k n)-n < J2 \T k b k } + J2 lo S 

k =1 k =1 k =1 k= 1 

< B + K + 2 Kpf 1 + K log|" c\ 

= B + 0({ log(l/£)) 3 ), 

where we use the fact that K < log 2 (l/c 2 ) (See Lemma A.4). Therefore, as long as B(\og(l/£)Y 3 —» 
oo, the total number of bits used is asymptotically no more than B, the given communication bud¬ 
get. 

Remark 4.2. The quantized estimation scheme does not make essential use of the parameters of 
the Sobolev space, namely the smoothness m and the radius c. The only exception is that in Step 
1.1 the size of the codebook Sk depends on m and c. However, suppose that we know a lower bound 
on the smoothness m, say m > ?% and an upper bound on the radius c, say c < c o. By replacing 
77i and c by mo and Co respectively, we make the codebook independent of the parameters. We 
shall assume mo >1/2, which leads to continuous functions. This modification does not, however, 
significantly increase the number of bits; in fact, the total number of bits is still B + 0(p~ 3 ). Thus, 
we can easily make this quantized estimator minimax adaptive to the class of Sobolev ellipsoids 
{@(m, c) : m > mo, c < Co}, as long as B grows faster than (log(l/e)) 3 . More formally, we have 

Corollary 4.2. Suppose that B e satisfies /k-(log(l/c))~' ! —> oo. Let O' be the quantized estimator 
with the modification described above, which does not assume knowledge of m and c. Then for 
m > mo and c < Co, 

lim supe g e (m , c )E||g - Q '\\ 2 = 

£ ^° M 9,C(9)<B SU Pee0(m,c) E ll 0 - ^ll 2 

where the expectation in the numerator is with respect to the data and the randomized coding 
scheme, while the expectation in the denominator is only with respect to the data. 

Remark 4.3. When B grows at a rate comparable to or slower than (log(l/£)) 3 , the lower bound 
is still achievable, just no longer by the quantized estimator we described above. The main reason 
is that when B does not grow faster than log(l/£) 3 , the block size T\ = |’log(l/£)] is too large. 
The blocking needs to be modified to get achievability in this case. 
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Remark 4.4. In classical rate distortion [8, 12 ], the probabilistic method applied to a random¬ 
ized coding scheme shows the existence of a code achieving the rate distortion bounds. Comparing 
to Theorem 3.1, we see that the expected risk, averaged over the randomness in the codebook, 
similarly achieves the quantized minimax lower bound. However, note that the average over the 
codebook is inside the supremum over the Sobolev space, implying that the code achieving the 
bound may vary over the ellipsoid. In other words, while the coding scheme generates a codebook 
that is used for different 6, it is not known whether there is one code generated by this random¬ 
ized scheme that is “universal,” and achieves the risk lower bound with high probability over the 
ellipsoid. The existence or non-existence of such “universal codes” is an interesting direction for 
further study. 

Remark 4.5. We have so far decdt with the periodic case, i.e., functions in the periodic Sobolev 
space W(m, c ) defined in (3.1). For the Sobolev space W(m, c), where the functions are not nec¬ 
essarily periodic, the lower bound given in Theorem 3.1 still holds, since W(m, c) is a subset of 
the larger class W(m, c ). To extend the achievability result to W(m. c), we again need to relate 
W(m,c ) to an ellipsoid. Nussbaum [19] shows using spline theory that the non-periodic space 
can actucdly be expressed as an ellipsoid, where the length of the jth principal axis scales as 
(t r 2 j) m asymptotically. Based on this link between W(m, c ) and the ellipsoid, the techniques used 
here to show achievability apply, and since the principal axes sccde as in the periodic case, the 
convergence rates remain the same. 

Proof of Theorem 4.1 We now sketch the proof of Theorem 4.1, deferring the full details to 
Section A. To provide only an informal outline of the proof, we shall write A 1 ~ A 2 as a shorthand 
for Ai = A 2 ( 1 + o(l)), and Ai < A 2 for Ai < A 2 ( 1 + o(l)), without specifying here what these 
o(l) terms are. 

To upper bound the risk E||0 — 9 1| 2 , we adopt the following sequence of approximations and 
inequalities. First, we discard the components whose index is greater than N and show that The 
proof is then completed by Lemma A.9 showing that the last quantity is equal to V e (m, c, B). 

5. Simulations 

Here we illustrate the performance of the proposed quantized estimation scheme. We use the func¬ 
tion 



which we shall refer to as the “damped Doppler function,” shown in Figure 3 (the gray lines). Note 


that the value 0.3 differs from the value 0.05 in the usual Doppler function used to illustrate spatial 


adaptation of methods such as wavelets. Since we do not address spatial adaptivity in this paper, 
we “slow” the oscillations of the Doppler function near zero in our illustrations. 
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Fig 3. The damped Doppler function (solid gray) and typical realizations of the estimators under different noise levels 
(n = 500, 5000, and 50000). Three estimators are used: the blockwise James-Stein estimator (dashed black), and two 
quantized estimator with budgets of 5 bits (dashed red) and 30 bits (dashed blue). 


We use this / as the underlying true mean function and generate our data according to the 
corresponding white noise model (1.1), 

<LX(t) = f(t)dt + edWit ), 0 < t < 1. 

We apply the blockwise James-Stein estimator, as well as the proposed quantized estimator with 
different communication budgets. We also vary the noise level e and, equivalently, the effective 
sample size n — 1/e 2 . 

We first show in Figure 3 some typical realizations of these estimators on data generated under 


22 










0 . 100 - 


0.075 - 

M 

S 0.050 - 


0.025 - 


0.000 -L 

0 


—i-1-1— 

2500 5000 7500 

Effective sample size n 


i 

10000 




- - blockwise J-S — 5 bits — 10 bits — 15 bits — 20 bits — 25 bits — 30 bits 


Fig 4. Risk versus effective sample size n = 1 Is 2 for estimating the damped Doppler function with different estimators. 
The dashed line represents the risk of the blockwise James-Stein estimator, and the solid ones are for the quantized 
estimators with different budgets. The budgets are 5, 10, 15, 20, 25, and 30 bits, corresponding to the lines from top to 
bottom. The two plots are the same curves on the original scale and the log-log scale. 


different noise levels (n = 500, 5000, and 50000 respectively). To keep the plots succinct, we 
show only the true function, the blockwise James-Stein estimates and quantized estimates using 
total bit budgets of 5 and 30 bits. We observe, in the first plot, that both quantized estimates deviate 
from the true function, and so does the blockwise James-Stein estimates. This is when the noise 
is relatively large and any quantized estimate performs poorly, no matter how large a budget is 
given. Both 5 bits and 30 bits appear to be “sufficient/over-sufficient” here. In the second plot, the 
blockwise James-Stein estimate is close to the quantized estimate with a budget of 30 bits, while 
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with a budget of 5 bits it fails to capture the fluctuations of the true function. Thus, a budget of 30 
bits is still “sufficient,” but 5 bits apparently becomes “insufficient.” In the third plot, the blockwise 
James-Stein estimate gives a better fit than the two quantized estimates, as both budgets become 
“insufficient” to achieve the optimal risk. 

Next, in Figure 4 we plot the risk as a function of sample size n, averaging over 2000 simula¬ 
tions. Note that the bottom plot is the just the first plot on a log-log scale. In this set of plots, we are 
able to observe the phase transition for the quantized estimators. For relatively small values of n, 
all quantized estimators yield a similar error rate, with risks that are close to (or even smaller than) 
that of the blockwise James-Stein estimator. This is the over-sufficient regime—even the smallest 
budget suffices to achieve the optimal risk. As n increases, the curves start to separate, with esti¬ 
mators having smaller bit budgets leading to worse risks compared to the blockwise James-Stein 
estimator, and compared to estimators with larger budgets. This can be seen as the sufficient regime 
for the small-budget estimators—the risks are still going down, but at a slower rate than optimal. 
The six quantized estimators all end up in the insufficient regime—as n increases, their risks begin 
to flatten out, while the risk of the blockwise James-Stein estimator continues to decrease. 

6. Related work and future directions 

Concepts related to quantized nonparametric estimation appear in multiple communities. As men¬ 
tioned in the introduction, Donoho’s 1997 Wald Lectures (on the eve of the 50th anniversary of 
Shannon’s 1948 paper), drew sharp parallels between rate distortion, metric entropy and mini¬ 
max rates, focusing on the same Sobolev function spaces we treat here. One view of the present 
work is that we take this correspondence further by studying how the risk continuously degrades 
with the level of quantization. We have analyzed the precise leading order asymptotics for quan¬ 
tized regression over the Sobolev spaces, showing that these rates and constants are realized with 
coding schemes that are adaptive to the smoothness m and radius c of the ellipsoid, achieving 
automatically the optimal rate for the regime corresponding to those parameters given the speci¬ 
fied communication budget. Our detailed analysis is possible due to what Nussbaum [18] calls the 
“Pinsker phenomenon,” refering to the fact that linear filters attain the minimax rate in the over¬ 
sufficient regime. It will be interesting to study quantized nonparametric estimation in cases where 
the Pinsker phenomenon does not hold, for example over Besov bodies and different L p spaces. 

Many problems of rate distortion type are similar to quantized regression. The standard “reverse 
water filling” construction to quantize a Gaussian source with varying noise levels plays a key role 
in our analysis, as shown in Section 3. In our case the Sobolev ellipsoid is an infinite Gaussian 
sequence model, requiring truncation of the sequence at the appropriate level depending on the 
targeted quantization and estimation error. In the case of Euclidean balls, Draper and Wornell [10] 
study rate distortion problems motivated by communication in sensor networks; this is closely 
related to the problem of quantized minimax estimation over Euclidean balls that we analyzed 
in [27]. The essential difference between rate distortion and our quantized minimax framework 
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is that in rate distortion the quantization is carried out for a random source, while in quantized 
estimation we quantize our estimate of the deterministic and unknown basis coefficients. Since 
linear estimators are asymptotically minimax for Sobolev spaces under squared error (the “Pinsker 
phenomenon”), this naturally leads to an alternative view of quantizing the observations, or said 
differently, of compressing the data before estimation. 

Statistical estimation from compressed data has appeared previously in different communities. 
In [26] a procedure is analyzed that compresses data by random linear transformations in the setting 
of sparse linear regression. Zhang and Berger [25] study estimation problems when the data are 
communicated from multiple sources; Ahlswede and Csiszar [2] consider testing problems under 
communication constraints; the use of side information is studied by Ahlswede and Bumashev [1]; 
other formulations in terms of multi terminal information theory are given by Han and Amari [14]; 
nonparametric problems are considered by Raginsky in [20]. In a distributed setting the data may 
be divided across different compute nodes, with distributed estimates then aggregated or pooled 
by communicating with a central node. The general “CEO problem” of distributed estimation was 
introduced by Berger, Zhang and Viswanathan [3], and has been recently studied in parametric 
settings in [13, 24]. These papers take the view that the data are communicated to the statistician 
at a certain rate, which may introduce distortion, and the goal is to study the degradation of the 
estimation error. In contrast, in our setting we can view the unquantized data as being fully available 
to the statistician at the time of estimation, with communication constraints being imposed when 
communicating the estimated model to a remote location. 

Finally, our quantized minimax analysis shows achievability using random coding schemes, 
which are not computationally efficient. A natural problem is to develop practical coding schemes 
that come close to the quantized minimax lower bounds. In our view, the most promising approach 
currently is to exploit source coding schemes based on greedy sparse regression [23], applying 
such techniques blockwise according to the procedure we developed in Section 4. 
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Appendix A: Proofs of Technical Results 

In this section, we provide proofs for Theorems 3.1 and 4.1. 

A. I. Proof of Theorem 3.1 

We first show 

Lemma A.l. The quantized minimax risk is lower bounded by Vfrri, c, B e ), the value of the opti¬ 
mization (Pi). 

Proof. As will be clear to the reader, V e (m, c, Bf) is achieved by some a 2 that is non-increasing 
and finitely supported. Let a 2 be such that 

n r 2 

<7? > ■ ■ ■ > <7* > 0 = a n+1 = a M = NTm' 

3 =1 71 

and let 

n 2 

0n(m, c) = {9 e a ftj - “T7T °j = 0 for 3 > n + 1} C 0(m, c). 

3 =1 7F 
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We build on this sequence of cr 2 a prior distribution of 9. In particular, for r 6 (0,1), write s 2 = 
(1 — t)ct 2 and let " r {9\ cr 2 ) be a the prior distribution on 9 such that 

9j j — 1,... ,n, 

P (6j = 0) = 1, j > n+ 1. 


We observe that 


RJm,c,B e )> inf sup E116? — (9|| 2 
'e,c(p)<B e eee„(m,c) 

> _ inf / E||0 — 9\\ 2 dn T (9] a 2 ) 

9,C(ff)<B e 

> I T -r T 


where I T is the integrated risk of the optimal quantized estimator 


I T = inf / E\\6 -6\\ 2 dTT T {6-a 2 ) 

6,C(9)<B e JR n ®{0}°° 


and r T is the residual 


r T = sup /_E|| 9 — 9\\ 2 dn T (9]a 2 ) 

flee(m, c ) 


where 0(m, c) = (M" <g) {O} oo )\0 ri (m, c). As shown in Section 3, lim r ^ 0 I T is lower bounded by 
the value of the optimization 


tj 2 £ 2 

u 3 


nun Y n 2 + 

2 ^^3 ^ 0 ? _|_ p2 

M i=i j=i ^ 


OO 

such that - log,, 

3 = 1 


07 


^ 2 ((T 2 + £ 2 )J < 


It then suffices to show that ry = o(I T ) as e —* 0 for r G (0,1). Let d n = sup 6 , g0n ( m c ) ||0||, which 
is bounded since for any 9 6 0 n (m, c) 


w 






a 2 7r^ 


ai7r 


We have 


r T = sup /_ E\\9 — 9\\ 2 d'K T (9-,a 2 ) 

<2/_(d 2 +E||0|| 2 )d7r T (0; cr 2 ) 

<2(d 2 n E(9£ 0 n (m,c)) + (P(0 £ 0 n (m,c))E||0|| 4 ) 1/2 ) 
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where we use the Cauchy-Schwarz inequality. Noticing that 


we obtain 


E|»II‘=E 


n , 2 
2 
3 

3 = 1 


= E e (4) e <4) + E e (4) 

,?lA ?2 j = 1 

n 

< E 44 + 3 E4 


ii4i2 


i=i 



< 3d 


4 

n’ 


fr < 2<i^ (P (0 ^ @„(m, c)) + \/3P(0 ^ 0 n (m,c))) 
< 6<i^v / P(6» ^ 0„(m, c)). 


Thus, we only need to show that yP (0 ^ 0 n (m, c)) = o(/ T ). In fact, 
P(0 £9 n (m,c)) 


= p ( E44> 

0 =1 


7T 


2m 


= p(E4(4-e(4)>^-(i-4E44 


V ? =1 


= p E4 (»?- e 4 2 ))> 


7T 


rc 


j'=i 


A=* 


7T 


= p (E 4 S M -!) > 5 P 7 E 44 

\j=i j=i 

where Z,- ~ A/"(0,1). By Lemma A.2, we get 

/ 2 ^2„2 


2m 


^ Q n (m, c)) < exp 


S"=i a fo 


T 


8(1 — r) 2 maxi<j< n 


2 2 / =eX P 


:"=i a K 


8(1 — r) 2 max l < J < n a jo -2 


Next we will show that for the a 2 that achieves V E (m, c, B e ), we have ^'PiO ^ 0 n (m, e)) = o(/ T ). 

2 

For the sufficient regime where B £ e ^+ T —>■ 00 as £ —>■ 0, it is shown in [22] that maxi a 2 a 2 = 

2 4 -rn / 

0(e 2m +i) and J r = 0(e 2ni +i), and hence that yP(0 ^ 0 n (m, c)) = o(/ T ). For the insufficient 
regime where B e £ 2 ™+ 1 —> 0 but still —» 00 as e —> 0, an achieving sequence cr is given later 
by (A.4) and (A. 3). We obtain that maxi <j< n a 2 a 2 = ()(Bz l ) and I T = 0(B~ 2m ), and therefore 
y/pf (9 ^ <r)n(rri. c)) = o(I T ). The sufficient regime where B e e 2 ™+ 1 —y d for some constant d is a 
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bit more complicated as we don’t have an explicit formula for the optimal sequence a 2 . However, 
by Lemma 3.5, for the continuous approximation <j 2 (x ) such that a 2 = cr 2 (jh)h 2m+1 , we have 


Xx 2m o 2 {x) = 


a 2 (x) 


cr 2 (x) + 21 

, 9/ x ® • 9/ ,— ~r < — + 2 a 

( <J 2 (x ) + l) 2 cr 2 (x) + 1 4 


where a = exp f 0 x ° log dx - 


2 d\ 
xo) 


and A are both constants. Therefore, 


max a 2 a 2 

l<j<n ^ J 


« j 2m a 2 (jh)h 2m+l < + 2a) • h. 

A 4 


Note that a 2 a 2 = 0(h 2m ) and that h = e 2m +i. We obtain that for this case I T = 0(£ 2m +!) 
and ^ 0 n (m, c)) = o(/ T ). Thus, for each of the three regimes, we have r T = o(I T ). □ 

Lemma A.2 (Lemma 3.5 in [22]). Suppose that X 1: ..., X n are i.i.d. Af(0, 1). For t e (0,1) and 
ujj > 0, j — 1,..., n, we have 


p (l+(+ 


1 ) > Xj J < exp 
3=1 


t 2 ELi Wi 


8 maxx^xn 


Proof of Lemma 3.2. This is in fact Pinsker’s theorem, which gives the exact asymptotic minimax 
risk of estimation of normal means in the Sobolev ellipsoid. The proof can be found in [18] and 
[ 22 ], □ 

Proof of Lemma 3.3. As argued in Section 3 for the lower bound in the sufficient regime, opti¬ 
mization problem (Qf) can be reformulated as 


max Jri 
C+J 

1 J / 
such that - y; log + ( 

£44 < 

3=1 


crj 


y p{(j 2 + £ 2 )) < Be 


( 0,2 ) 


7T 


2m 


(a 2 ) is decreasing and 


cr. 


a 2 + 


> rj. 


Now suppose that we have a series (a 2 ) which satisfies the last constraint and is supported on 
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the first constraint, we have that 


Jr\ 



a- 


crj + £ 2 


1_ 

J 


< J exp 




< exp 




< 



(A.l) 


This provides a series of upper bounds for Q e (m, c, B e ) parameterized by J. To minimize (A.l) 
over J, we look at the ratio of the neighboring terms with J and J + 1, and compare it to 1. We 
obtain that the optimal J satisfies 




< exp 

Jl \m 


B 


£ \ < (J + 1 ) 


J +1 


(J + l)! 

Denote this optimal J by J £ . By Stirling’s approximation, we have 

lim ^ = 1, 

and plugging this asymptote into (A.l), we get as e —* 0 


(A.2) 


(A.3) 


exp ( — ) J £ \ 


7T 2m v v m 


2m 

Je 


~ — J- 2m ~ C 171 B~ 2m 

2m £ ~2m £ 


7T 


7T 


This gives the desired upper bound (3.6). 

2 

Next we show that the upper bound (3.6) is asymptotically achievable when B £ e 2 ™+1 —y 0 and 
B £ —y oo. It suffices to find a feasible solution that attains (3.6). Let 

C 2 / 7T 2m 




■W, 


-, j 1 j • • • j Je ■ 


(A.4) 


Note that the entire sequence of (ct;)'/=i does not qualify for a feasible solution, since the first 

~4 

constraint in (Q 2 ) won’t be satisfied for any r] < 7^- We keep only the first J' e terms of (a 2 ), 
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where J' e is the largest j such that 


Thus, 


3 


d +e 2 


> a 


2 

Je' 


J 's 1 

E 2 lQ g 

3 =1 



J 'e 1 

< E o lo s 

3 = 1 



J e ! 

< E o lo s 

3=1 



<b f 


(A.5) 


where the last inequality is due to (A.2). This tells us that setting r\ = a 2 leads to a feasible 
solution to (Q 2 ). As a result, 

Qs{m, c, B e ) > J' £ aj E . ( A.6) 

If we can show that J' e ~ J £ , then 


P~2 

^E a Js 




c 2 m 2rn 


7T 


2m 


-B, 


—2m 


(A.7) 


To show that J' £ ~ -J s , it suffices to show that aj v ~ aj e . Plugging the formula of a 2 into (A.5) and 
solving for a 2 ,, we get 


a 


2 

J'e 


r^j 


r^j 


7r 2m J„ 


+ W>' +4 


£ 2 a 2 - 


7T 2rn J £ Je 


2e 2 


+ 


1 7T 2m 7 C 2 

£ A-^e 2 a 2 


7T 2m J £ 7T 2m J £ 2 C 2 7r 2m J e Je 

2" 2 



where the equivalence is due to the assumption B e s 1 —>■ 0 and a Taylor’s expansion of the 
function \fx. □ 

Proof of Lemma 3.5. Suppose that o 2 (x) with x 0 solves (Vf). Consider function a 2 (x) + £v(x) 
such that it is still feasible for (Vf), and thus we have 


r x o . 

/ x 2m v(x)dx < 0. 
Jo 


Now plugging o 2 (x) + fv(x) for a 2 (x) in the objective function of (P 4 ), taking derivative with 
respect to £, and letting £ —* 0, we must have 


r x o v(x) 


Jo (o r2 (ic) + l) 2 


dx+x o exp (— f 
\x 0 Jo 


log 


<j 4 (x) 

o 2 (x) + 1 x 0 J x 0 J o a 2 (x) a 2 (x) + 1 


2d\ 1 r x o 2v(x) v(x) 


-dx -1 — 


dx < 0, 


which, after some calculation and rearrangement of terms, yields 


r x o 


v{x) 


( cr 2 (x ) + l) 2 


r x o 


+ exp — / log 


a 4 (x) 


, 2d 
dx - 


X 0 Jo cr 2 (x) + 1 X 0 J <7 2 (x)(<J 2 (x) + 1) 


a 2 {x) + 2 


dx < 0. 
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Thus, by the lemma that follows, we obtain that for some A 

_ 1 _ + exp (- r log a ‘ (v) dv - -) aHx) + 2 = \x 2m 

(<r 2 (x) + l) 2 P UoA g X(y) + l V ) X(x)(a 2 (x) + 1) ' ' 

i§ 

Lemma A.3. Suppose that f(x) and g{x) are two non-zero functions on (0, xf) such that for any 
v{x) satisfying ff° f(x)v(x)dx < 0, it holds that ff° g(x)v(x)dx < 0. Then there exists a constant 
A such that f{x) = A g{x). 

Proof First we show that for any v(x) such that ff° f(x)v(x)dx = 0 we must have ff° g(x)v(x)dx = 
0. Otherwise, suppose that u 0 (^) is such that ff° f(x)v 0 (x)dx = 0 and ff° g(x)v 0 (x)dx < 0. 
Then take another v(x) with ff" f(x)v(x)dx < 0 and consider v 7 (x) = v(x) — yvf x). We have 
f(x)v 7 (x)dx < 0 and ff° g(x)v 7 (x) = ff° v(x)g(x)dx — 7 ff° g(x)v 0 (x)dx > 0 for large 
enough 7 , which results in contradiction. 

Let A = f(x) 2 dx/ ff° f{x)g{x)dx as the denominator cannot be zero. In fact, if ff° f(x)g(x)dx 
0, it would imply that Jf° g(x) 2 dx = 0 and hence g(x) = 0. Now consider the function f(x) — 

A g(x). Notice that we have ff° f(x)(f(x) — A g(x))dx = 0 by the definition of A. It follows that 
fo° g{.%)(f(x) — A g(x))dx = 0 , and therefore, ff°(f(x) — A g(x)) 2 dx = 0 , which concludes the 
proof. □ 


A.2. Proof of Theorem 4.1 


Now we give the details of the proof of Theorem 4.1. For the purpose of our analysis, we define 
two allocations of bits, the monotone allocation and the blockwise constant allocation, 

k for j € Jfc, 0 < hj < b max ^ • (A. 8 ) 

■_bj, 0 < bj < 6 max | , (A.9) 

where 6 max = 21og(l/e). We also define two classes of weights, the monotonic weights and the 
blockwise constant weights, 

^bik = {(wj)£=i : c jj = uj k for j G J k , 0 < < l } , (A. 10) 

Almon = {) j=l • tOj— 1 ^ tUj, 0 ^ UJj f l| . (A.ll) 

We will also need the following results from [22] regarding the weakly geometric system of blocks. 


n blk (5) = <1 (bj)f =l : Y.bj<B, bj = 1 

3 = 1 


n mon (5) = \ (bj)f =1 : Y.bj<B, 6j_i > 

3 = 1 


Lemma A.4. Let {J k } be a weakly geometric block system defined by (4.1). Then there exists 
0 < £q < 1 and C > 0 such that for any £ G (0, £q)> 


max m 
l<k<I<-l Tb 


K < Clog 2 (1/e), 
T k+ 1 


< 1 + 3 p £ . 
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We divide the proof into four steps. 


Step 1. Truncation and replacement 


The loss of the quantized estimator 9 can be decomposed into 


K 


110 _ ^ll 2 = H Ife -^(fc)!! 2 + 

k= 1 j=N +1 

where the remainder term satisfies 


V 0k N~ 2m V a 2 9 2 = 0(N 

j=N +1 j=N +1 


— 2 m\ 


If we assume that m > 1/2, which corresponds to classes of continuous functions, the remainder 
term is then o(e 2 ). If m < 1/2, the remainder term is on the order of 0(s 4rn ), which is still 

4m 

negligible compared to the order of the lower bound e 2m +i . To ease the notation, we will assume 
that m > 1/2, and write the remainder term as o(e 2 ), but need to bear in mind that the proof works 
for all m > 0. We can thus discard the remainder term in our analysis. Recall that the quantized 
estimate for each block is given by 

A Sl-T t e \/ , 

v(k) £ VI 2 

and consider the following estimate with S k replaced by S k 

2 Sl-T k e 2 J, o-i ^ 

9(k) = -~-VI - 2 2bk Z( k y 

Ok 

Notice that 


\\9(k) ~ ^(fc)|| = 


Si - T k e 2 Si - T k e 2 


< 


S k 

S k S k + T k e 2 


s k s k 


S k 

S k ~ S k 


y/l- 2 ~ 2b H 


'(fc)l 


< 2 e 2 


where the last inequality is because S k S k > T k e 2 and 


S k - S k 


<e 2 . Thus we can safely replace 


9( k ) by 9( k j because 


II Q(k) - 0(k) II 2 

= II 0(k) ~ 6(k) + 0( k ) ~ 6{k) || 2 

< ife - %-)|| 2 + 11%-) - ^( fc ) ll 2 + 2||0 (fc ) - 0( fc )||||0( fc ) - 0 ( k) \\ 
= \\0(k) ~ Q( k ) II 2 + 0(e 2 ). 


Therefore, we have 


E||0-£|| 2 


life - °{k) 


k= 1 


+ 0(Ke 2 ). 
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Step 2. Expectation over codebooks 


Now conditioning on the data Y, we work under the probability measure introduced by the random 
codebook. Write 


A k — 


S% - T k e 2 

02 
O b. 


and Z, 




We decompose and examine the following term 


ll^ll' 


A k = || 0(k) ~ 8(k )|| 2 

II^(fc) ^kSkZ(k) T A kSkZ^k) ^(fc)|| 

= ||^(fc) — AfcS' / tZ( fc ) || 2 + || AfeS'fcZ(fe) — 9(k) || 2 

s -V-" s -V-' 

+ 2(0(fc) — AfcS'fcZ(fc), \ k SkZ( k ) — 9(k)) ■ 

" - V -' 

Ak,3 

To bound the expectation of the first term A^i, we need the following lemma, which bounds the 
probability of the distortion of a codeword exceeding the desired value. 


Lemma A.5. Suppose that Z\..... Z n are independent and each follows the uniform distribution 
on the t-dimensional unit sphere § t_1 . Let y e be a fixed vector, and 


Z* = argmin 

Z$zZ i :n 


Vl - 2 - 2 iz 


y 


If n — 2 qt , then 
where 

Observe that 


E 


Vl - 2~ 2q z* - y\\ <2- 2g (l + u( y t)) + 2e~ 2t 
6 log t + 7 


u(t) = 


t — 6 log t — 7' 


Ak,i — 


9{k) ^kSk^(k) 

AfcSfc\/l — 2 — A k S k Z(k) 


= Aj \Sl 

Then, it follows as a result of Lemma A.5 that 

(S 2 k - T k e 2 ) 2 

(SI - W 

si 

(_sj - nfi 

C 2 
J k 


\[l - 


E (A fc) i | F( fc ) ) < 


2 -^Za., - Z, 


< 


< 


(*) “ ^(fc) 


2 _2bt: (l + z/ e ) + 2e~ 2Tk 

2~ 2bk (l + u £ ) +2e~ 2Tl 
2c 1 

2~ 2bk (l + v ) -I_p 2 
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where u e = • Since A k)2 only depends on Y\ k ), E (A kt2 \ Y(k)) = A k)2 . Next we consider 

the cross term A k;3 . Write 7 *. = and 

A k ,3 2 (@(k) X k S k Z^ k ^ X k S k Z (k) @(k)) 

2 {y(fc) ^k^kZ( k ) j TfcEjfc) ^(k)) 

V -V-' 

Ak,3a 

■F 2 (p(k) A fc 5*Z (fc) , \ k S k Z(j, j) 7 fc^(fc)/ • 

'- V -" 

Ak,3b 

The quantity y fc is chosen such that (Y( fc ), 7 fc Y( fc ) — 0( fc )) = 0 and therefore 

A k ^3a 2 (^(k) ^k$kZ( k ) j 7fe^(fe) ®(k)/ 

2 (ily-x X k S k Z^ k 'j ), 7fc^(fc)) 

where 1 I y denotes the projection onto the orthogonal complement of Y^. Due to the choice of 

(k) 

Z( fe ), the projection n y ± (0( fc ) — X k S k Z( k) ) is rotation symmetric and hence E (A fci3a | Y( fc )J = 0. 

(fc) 

Finally, for we have 

E (^4fc,361 Y( fc )) 

< 2||AfcS'/ c Z(j k ) — 7fcY(fc)||E (||0(fc) — AfcS'fcE’( fc )|| | Y^) 

< 2||AfcS’/ c Z( fc ) — 7feY ( fe )||\/e (||0(fc) — AfcS'fcZ( fc )|| 2 | Y( fe )) 

< 2||AfcS'feZ(j t ) — 7feY(fc)|| 

Combining all the analyses above, we have 

E | Y(fe)) 

z (^-T fc e 2 ) 2 0 _ 2 j 2c 2 ,, 

< - ^2 -2 (1 + z/ e ) + (j kn y m £ + l|AfeS fc Z (fe) - 0 (fc) 

+ 2||AfcS'fcZ( jk ) — 7feY(fc)|| 

and summing over k we get 




E(||»-«|| 2 |y) 

- H ~— c2 — ' “ 2 2 ^ fc (l + y) + X! ||AfcS' fc Z( fc ) - 6 l (fc )|| 2 

fc=i '-’fc fc=i 

+ 2 ^2 ||A kS k Z(f.) — 7fcE(fc)|| tA k 2 ^ 2~ 2 ^(1 + u e ) + 0(e 2 ) + 0(li£ 2 ). 
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Step 3. Expectation over data 


First we will state three lemmas, which bound the deviation of the expectation of some particular 
functions of the norm of a Gaussian vector to the desired quantities. The proofs are given in Section 
A.3. 

Lemma A.6. Suppose that X t ~ Af(9 u a 2 ) independently for i — 1,..., n, where 11011 2 < c 2 . Let 
S be given by 

s/no 2 if\\X\\ < \/no 2 

S = < \/no 2 + c if ||A|| > s/no 2 + c 
11 A" 11 otherwise. 

Then there exists some absolute constant Cf such that 


E 


S 2 -no 2 {6,X)\ 2 


S 


11*11 ) 


< C 0 o 2 . 


Lemma A.7. Let X and S be the same as defined in Lemma A.6. Then for n > 4 

|| 0|| 4 , 4 n , 


^ (S 2 - no 2 ) 2 
E-- ——— < 


-o 


S 2 ||#|| 2 + ncr 2 n — 4 

Lemma A. 8 . Let X and S be the same as defined in Lemma A.6. Define 

( 11 * 11 “ - no 


e+ = 


12 _ 2 X S ' 2 — no 2 


V ll*ll 2 


A, 0 t = 


‘5*11*11 


X. 


Then 


E|| 6 » t — 6>|| 2 < E|| 6 » + — 6>|| 2 < 


no 


211/3112 


+ 4cr 2 . 


PH 2 + no 2 

We now take the expectation with respect to the data on both sides of (A. 12). First, by the 
Cauchy-Schwarz inequality 




< V®l \^kSkZ(k) — || 2 \/E 

We then calculate 


( S 2 T k e 2 ) 2 2 _ 2b ^ i + ^ ) + 0 ^ 2) 


(SI T k £ 2 ) 2 _ 2 _ 2h ^ + ^ + 0 ^ 2) | _ 


(A.13) 


C 2 

Or. 


\/n^kS k z {k) - lk Y {k) w 2 


= WE 


SI - Tk£ 2 Y( t) 

(0(k),Y(k)} 

% 

Sk ||%|| 

11% II 

ll%ll 


= V E l 

E Cq£, 


' s k — T k £ 2 (%,%) V 

, Sk ll%)ll J 
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where the last inequality is due to Lemma A. 6 , and C 0 is the constant therein. Plugging this in 
(A. 13) and summing over k, we get 


K 


E E \\\ k S k Z [k) - lk Y (k) l 


l T *A 2 ) 2 2 — 2b ,(i + Z4 ) + 0 ( £ 2 ) 


k =1 


C 2 
Ol , 


A 


<C 0 sY.\ E 


fc=i 



(SI - Ae 2 ) 2 „_ 


si 


2 - 2 ‘*(l + I/J + 0 (£ 2 ) 


Therefore, 


< C 0 VKe\lEY; ^ J^) 2 2-^(l + I/ E ) + 0(Ke 2 ). 
k =i 


Elixir 


< E £ J^ 2)2 2~ 2fcfe (l + !/ e ) + E £ ||A fc S fc Z (fc) - 0 (fc) || ; 

Oi, 


k= 1 


fc=l 


Si 


s 2 


+ CoVK£\Ie Y J feg2 ) 2 2-26 fc (i + ^ + 0(/i£ 2 ) 

&L. 


k =1 


+ 0(AZ 2 ). 


Now we deal with the term Ab Recall that the sequence b solves problem (4.2), so for any sequence 
b € ITbik 


K 


„ (si-nff St f (si- nfl 2h 

c2 — Z-^ 02 

fc=l °k k =1 


Notice that 


(SI - Ik 2 ) 2 (SI - Ae 2 ) 2 


SI 


C 2 
O i. 


q 2 q 2 


62 c2 71 -.2 


C2 C2 


= 0 (£ 2 ) 


and thus, 


X / c2 rp 2\2 ~ K f C2 7^ c 2\2 

£ ; fc£ J 2~ 26 * < Y 1 fc o2 fc j 2- 25fc + 0(/Te 2 ). 


C 2 

fc=l 


C 2 

fc=l 


Taking the expectation, we get 


K f C2 rp 2\2 „ A' / C2 7“" f 2\2 

E £ Q2 fc£ J 2~ 2fcfc < £ E l fc Y 2 ~ 2h + 0(Ke 2 ). 

k= 1 


C 2 

fc=l 


C 2 


Applying Lemma A. 7, we get for 7), > 4 

\2 


(A 2 -r fcg 2 ) 2 iim 4 m 

SI - \\9 {k) \\ 2 + T k e 2 T k - 4 
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and it follows that 


K ( c2 _ rri 2\2 „ K 

e J2 — 2 2 ~ 2&fc < E 

fc = l ^ fc: 


ll^)H Z 


^|| 0 {fe) || 2 + T ^ 


2 _ibfe + 0(Ke 2 ). 


Since b e fl b ik is arbitrary, 


X (c2 _ rri 2\2 „ A 

E ) 2 _ 2fcfc < m . n y 

4-^ C2 — ^TT.„ Z—/ 

ife=l D /c 


iiMr 


beribik fc= i ||0(fc )|| 2 + T k £ 2 

Turning to the term B 2 , as a result of Lemma A .8 we have 


2 ~ 2bk + 0{K£ 2 ). 


II \ C V a ||2 ^ ||^(fc)|| 2 7fc ^ 2 ,2 

Combining the above results, we have shown that 

E||0 - 9 \\ 2 < M + 0(/i£ 2 ) + C 0 VKe^/M + 0(Ke' 2 ) 


(A. 14) 


where 


K 

M — (1 + u £ ) min 

fcen blk (B) ^ 


+ v 

\0 {k )\\ 2 + T k E* 


\\0^)\\ 2 T k e 2 
\\0 {k) \\ 2 + T k E 2 


t-L T v e ) 


ben blk (B)^ ||0 (fc) |p 


+ ™k n E ((! - ^) 2 |l%-)f + u 2 k T k £ 2 ) . 

w 6 U wk k=1 


Step 4. Blockwise constant is almost optimal 


We now show that in terms of both bit allocation and weight assignment, blockwise constant is 
almost optimal. Let’s first consider bit allocation. Let B l = 1+ ! jp (B — 7\6 max ). We are going to 
show that 


mm 

fren b i k (.B) 


K 

E 

fc 


ll 6l (fc)ll / 


1—2 bi. 


=JMI 2 + T k e 2 


< 


mm 


N 


9f 


1—26,- 


&en mo „(B') 6 2 + e 2 


(A.15) 


In fact, suppose that b* e II mon (i? / ) achieves the minimum on the right hand side, and define b* by 




rnax ieBfc b* j e B k 
0 j > N 
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The sum of the elements in b * then satisfies 


oo K— 1 

E = E T k+i max b* 

j=i k =o 3eB *+' 

K—l 


= Tib\+ Y T k +i max b* 


fc=l 


1 rp 


<T lbmax + E ^ E ^ 


Ti, ^ 

ir-i 


fc=l Jfc ieBfc 


< ^l^max + (1 + 3p e ) Y. Y, b * 

k =1 jS-Bfc 

< T\b max + (1 + 3p e )B' 

= B, 


which means that 6 * e 11 hik f /i) - It then follows that 

* IIM 4 


nnn V 

B) ^ 

K 


)— 26 fc 


6 en blk (s) || 0 (fc) || 2 + T fc e 2 

ll^)H 4 


<E 


=1 ll^wll 2 + T k s< 


2~ 2b t 


K 


O' 


< v v _ 2~ 2b i 

-2^ g2 + 2 Z 
k=l j&B k U 3 ^ b 


N 


0 


j 2~ 2b *j 


Q? -L 

3=1 

N Of 


= min > 

6 en mon (s') 0] + e 2 


2 ~ 2b i 


where (A. 16) is due to Jensen’s inequality on the convex function 


X+£ z 


3 


(^T fc ll^ffc)!!' y < y. 

ill%lll 2 +e 2 “ + 

Next, for the weights assignment, by Lemma 3.11 in [22], we have 

K 


mm 

“6!2m 


m E ((! ~ UkY\\0 {k )\\ 2 +u 2 k T k £' 


k =1 

< (1 + 3 p e ) ( min Y (( x - u 3 ? e ] + <# 2 )) + T x e 2 . 
\ a;eUmon k =i ' / 


(A. 16) 


(A.17) 
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Combining (A.15) and (A.17), we get 


M = (1 + v e ) min Y - 2~ 2 ^ fc 

; ben blk (s) ^ || 6> (fc )|| 2 + T fc e 2 

+ “A 11 (C 1 -^) 2 ||%)|| 2 +ulT k £ 2 


weObik 


k =1 


A' 


< (1 + z/ £ ) min ^ 


ll^(fc)l| 4 o-2fe fe 


fcen blk ( B )^ || 0( fe )|| 2 + T fe £ 2 
+ (1 + 3p e ) min ^ ((1 — ^fc) 2 ||#(fc)|| 2 + ^fc7fc£ 2 ) + T\£ 2 


■ fe=i 
N 


< (1 + u £ ) ( min V 

v 7 V icn . jr'i ' 


0 ! 


6en mon (S') $ 2 + £ 2 

N 


j 2~ 2b i 


+ J^ in H (( X _ u j) 2e ‘} + w | £2 ) ) + T i £2 - 

turnon • -i / 

J = 1 7 


Then by Lemma A.9, 

M < (1 + v e )V e (m , c, IT) + T]£ 2 . 
which, plugged into (A. 14), gives us 

E||0 - e \\ 2 < (1 + yv 6 (m, c, B') + 0(Ke 2 ) 

+ C 0 VKe^( 1 + u £ )V £ (m, c, B') + 0(/Te 2 ). 


Recall that 


and that 




lim ^ = lim---(1 Tl6n 

e—>o B e —1 -\- 3 p e 


B 


= 1 . 


Thus, 


y.Kc.g) 


£ 14(m, c, B) 

. . . 4m . 

Also notice that no matter how B grows as e —> 0, V £ (m, c, B) = 0{£ 2m + 1 ). Therefore, 

b-■*»*-*»; 

£->o V e (B, m, c ) 

<lim| ( 1 + „ e )M^4 + ° {Ke2) 


£— 


V £ (B,m,c) V(B,m,c ) 


+c»t/(i + ^) ..,5 £2 ° (Ke 


V e (B, m, c) V e (B,m,c) \V s (B,m,c) 


2 \ \ 2 


= 1 
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which concludes the proof. 

Lemma A.9. Let V\ be the value of the optimization 


max mm 


TV / n4 o2 c 2 

O A ni u,:£ 

3 2~ 2bj + J 


E 


9 b ~i\0j+£ 2 ‘ 


e 2 + s\ 


J=1 \ 3 

N J C 2 


such that ^bj < B, bj > 0, y a 2 0 2 < 

3= 1 3 =1 


7r 


2m : 


and let V 2 be the value of the optimization 


max mm 


N / M 

U j 0-2 b 


E 


0 ^ ^V 0 i+ £2 


2 2 ^ + (1 - Wj) 2 0 2 + U3j £ 2 


J=1 ' 3 
N 

such that ^2 bj < B, bj_i > bj, 0 < bj < b max , tOj-i > U3j, 

j =i 


j 


i=i 


J J — y|-2m 


77ien Vi = V 2 . 


G4i) 


(^ 2 ) 


A.3. Proofs of Lemmas 

Proof of Lemma A.5. Let £(t) be a positive function of t to be specified later. Let 

Po = P (||vT - 2 - 2 *Z 1 - y|| < 2"Vl + CW) • 

By Lemma A.10, when £(£) < 2(1 — 2~ 2,? ), p 0 can be lower bounded by 

( 2 -V 1 + CW/ 2 )' 


. r(| + i) 

PO > -r=7 


, 7-1 


v^r(^) 


We obtain that 

E II a/1 - 2- 2 vZ* 


y 


< 2~ 2q (l + C(t)) + 2P (|\/l-2- 2 9Z* - y|| > 2"Vl + C(f)) 
= 2 _2<? (1 + £(£)) + 2(1 — Po)"’- 
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To upper bound (1 — p 0 ) n , we consider 


log ((1 - p 0 ) n ) = n log(l - po) < -np 0 

^ o at r (l + x ) ( n -a 


< -,Tk±i) (1 + ^(t)/2) (2/{(,)+1| Wcii+ii 
v ntT{ —) 


V / 27r(^)2 + 2e 2 


yfirte(\ — \)ze ^ 2 ) 


--j- e 2(2/C(i)+l) 

t (t 


_ 3 _ 1 

= —e 2 1 2 


< —e —1 t~ 2 e 2(2/c(t)+i) 


t \ 2 t -1 

- g2(2/C(t) + 1 ) 


where we have used Stirling’s approximation in the form 

V2nz z+ 1 / 2 e _z < r(z + 1 ) < ez z+1 ^ 2 e~ 


In order for (1 — p 0 ) n < e 2t to hold, we need 


1 1 *-l 


- 2 1 = -e~ t~ieWim+v , 


which leads to the choice of Q it ) 


C (t) = 


2log(2et2) 


61ogf + 41og(2e) 

1 t — 3 log t — 2 log(2e) — 1 


Thus, we have shown that when q is not too close to 0, satisfying 1 — 2 2q > ((t)/2, we have 

E I a/1 — 2 ~ 2q Z* - yf < 2~ 2q (1 + C(t)) + e~ 2t . 

When 1 — 2~ 2q < £(£)/2, we observe that 

E |/1 -2- 2 q Z* - yf = 1 - 2~ 2q + 1 - 2/1 - 2~ 2q E (Z*,y) 

< 2 - 2~ 2q = 2~ 2q (1 + 2 [2 2q - 1 )) 

and that 

2(2 2q - 11 < 2 _ 2 = 2 C(*) = 6 logf + 4 log(2e) 

1 — C(^)/2 2-C (t) t - 61ogt - 41og(2e) - l' 

Now take u(t) = Notice that K*) > t-eS-SSi-i - C(f). we have for any q > 0 


E Vl - 2~ 2q Z* -y < 2~ 2q (1 + v(t)) + e 
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Fig 5. Illustration of the geometry for calculating po 


Lemma A.10. Suppose Z is a t-dimensional random vector uniformly distributed on the unit 
sphered~ l . Let y be afixed vector on the unit sphere. For 5 < landC> 0 satisfying ( < 2(1 —5 2 ), 
define 

Po = ¥(\\Z-y\\ <6fl + C). 

We have 

Proof The proof is based on an idea from [21], Denote by V t and A, the volume and the surface 
area of a t-dimensional unit sphere, respectively. We have 

Vt—f Atr^dr = ]-A t . 

Jo t 

From the geometry of the situation as illustrated in Figure 5, p 0 is equal to the ratio of two areas 
Si and S 2 . The first area S } is the portion of the surface area of the sphere of radius a/1 — 5 2 and 
center O contained within the sphere of radius 5y/l + ( and center y. It is the surface area of a 
(t — 1 )-dimensional polar cap of radius a/1 — 5 2 and polar angle 6 0 , and can be lower bounded by 
the area of a (t — 1 )-dimensional disk of radius a/1 — S 2 sin 9 0 , that is, 

Si > V t -1 (a /1 - (5 2 sin0 o ) 4 1 = -^—^A t _ x (a /1 - S 2 sin 6 * 0 )* 

The second area S 2 is simply the surface area of a (t — 1 )-dimensional sphere of radius \/l — 5 2 

,s 2 = A t (%/r^)^ 1 . 
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Therefore, we obtain 


_ Si t=i A *~i (V'l - £ 2 sin6» 0 ) 
Po ~ 7T — — 


t -1 


A-l 


(sin 6 * 0 ) = 


^ 2 2 2 '"sin 9 n 


\ t — 1 


a,(VT^p)‘-‘ 

where we have used the well-known relationship between A t _i and A t 

A t -1 _ 1 (t - i)r (| +1) 

A ~ tr(^ + i) 

Now we need to calculate sin 0 O . By the law of cosines, we have 

1 + 1 —5 2 —<5 2 (1 + C) 1 — <5 2 (1 + C/2) 


cos 6 *o = 

and it follows that 

sin 2 d 0 = 1 — cos 2 6*0 = 1 
Now since C < 2(1 — 5 2 ), we get 


2 v / l^ 

1 + 5 4 (1 + C/2 ) 2 — 2<5 2 (1 + C/ 2 ) _ 2 


1 -S 2 


= S 2 (1 + C) 


S 4 ( 2 


4(1 — <5 2 ) ’ 


sin do > Sy/l + C/2, 

which completes the proof. 

Proof of Lemma A. 6. We first claim that 


□ 


E - 


S 2 - no 2 (d, X ) 

W 


< E 


( W 
V ||*|| 


no 


(e,x) X 
11*11 ) 


In fact, writing E r (-) for the conditional expectation E(- | ||X|| = r), it suffices to show that for 

r < \Jno 2 and r > \Jno 2 + c 


E r 


S 2 - no 2 (d, X ) 

— s W 

When r < \Jno 2 , it is equivalent to 


< E, 


||X|| 2 -na 2 (d,X) 


11*11 


11*11 


F ,'^ X >V<F IW" W 

r ' ||X|| ) ~ r 


11*11 ||*|| 

It is then sufficient to show that E r (0, X) > 0. This can be obtained by following a similar argu¬ 
ment as in Lemma A .6 in [22]. When r > \Jno 2 + c, we need to show that 


E, 


( yjno 2 + c ) 2 — no 2 (d,X) 


\Jno 2 + c 


11*11 


< E r 


|X|| 2 -no 2 (d,X) 

“1*11 W 
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which, after some algebra, boils down to 


( \Jna 2 + c) 2 — na 2 r 2 


Vna 2 + c 




This holds because 


(Vna 2 + c) 2 — ncr 2 
\/ncr 2 + c 

> ||#|| 2 + r 2 — ncr 2 — 2E r (6 ) ,X) 

> E r ||X — 0|| 2 — ncr 2 

> 0 

where we have used the assumption that r > \Jna 2 + c, 11 (9 1 < c and that 

E r ||X - 9 1| > E r ||X|| - ||0|| > VW 2 . 



2 2 
r — na 


+ '-— - -E r (0,X) 


Now that we have shown (A.3) and noting that 


ip ( 1|X|| 2 — ncr 2 (0,X)\ 2 

V ||x|| ||x|| ) 


(t 2 E 


( l|X/o'|| 2 — n 
V ||X/cr|| 


(9/a,X/a) \ 

11 X/cr 11 ) 


we can assume that X ~ N(9, /„) and equivalently show that there exists a universal constant C 0 
such that 


F ( l|X|| 2 — n (0,X)\ 2 
V I|X|| ||X|| ) 

holds for any n and 0. Letting Z = X — 6 and writing ||0 || 2 


<C 0 

= £, we have 


F ( IIX|| 2 — n (0,X)\ 2 
V ||X|| ||X|| ) 


= E 


n\z + ef~n-j 
V \\z + e\\ 


< 2E 


Z + 9f-n~i\ 
\\Z + 6\\ ) 


(0,Z) \ 

\\z + e\\J 


2 

+ 2E 


( (Q,z) 
\\\z + e\\ 


2 


< 

< 


+»-*-+»«ss+»(,s,)' 


8 (w + Q 

n + £ — 4 


+ 2E 


( (Q,z) 

V||Z + 0|| 


2 


2 
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where the last inequality is due to Lemma A. 11. To bound the last term, we apply the Cauchy- 
Schwarz inequality and get 


E 


(0,Z) 

\z + e\\ 


< i/E- 


z + o II 


r E (9,zy 


< 


3 (n - 4)£ 2 


(n — 6) (n + C, — 4) (n + £ — 6) 

where the last inequality is again due to Lemma A.ll. Thus we just need to take C 0 to be 


3 (n - 4)^ 


8(n + Q 

n>?!f>o n + £- 4 V (n-6)(n + e-4)(n + e-6)’ 

which is apparently a finite quantity. □ 

Proof of Lemma A.7. Since the function [x 2 — no 2 ) 2 jx 2 is decreasing on (0, Vno 2 ) and increasing 
on (\/no 2 , oo), we have 

(S 2 -no 2 ) 2 (\\X\\ 2 -no 2 ) 2 


S 2 


and it follows that if n > 4 


(S 2 - no 2 ) 2 ^ 

E--- —— < E 


IIA1 2 

{\\X\\ 2 - no 2 ) 2 


S 2 


IIA1 2 

= E||X|| 2 — 2 no 2 + n 2 cr 4 E ^ 


1 

iW 


< 

< 


12 2 I 

— no + 


n 2 cr 4 


||#|| 2 + ncr 2 n — 4 

where (A.20) is due to Lemma A.ll, and (A.21) is obtained by 


||#|| 2 + no 2 — 4 o 2 

An 9 
o z 


(A. 18) 
(A. 19) 
(A.20) 
(A.21) 


12 2 i 

— no + 


n 2 cr 4 


,,</ J 2 + no 2 — 4cr 2 H^lp + ncx 2 
||0|| 4 + 4a 2 (na 2 -||0|| 2 ) ||£|| 4 


||#|| 2 + no 2 — 4cr 2 
4n 2 cr 6 


1 2 + no 2 


2 + no 2 — 4cr 2 ) (|| 6> || 2 + no 2 ) 
n — 4 


□ 
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Proof of Lemma A.8. First, the second inequality 


E\\9 + 


0 1| 2 < 


na ' 2 ^ 2 +4 a 2 
\\9\\ 2 + na 2 


is given by Lemma 3.10 from [22]. We thus focus on the first inequality. For convenience we write 


9+{x) = 



SfO) 


s(x) 2 — no 2 
s(x)||x|| 


with 


s{x) = < 


Vna 2 
Vna 2 + c 
IMI 


if ||x|| < Vna 2 
if ||x|| > Vna 2 + c ■ 

otherwise 


Notice that g+( x ) — 9]V) when ||x|| < Vna 2 + c and g + (x) > g-\( x ) when ||x|| > Vna 2 + c. 
Since and g + both only depend on |jx||, we sometimes will also write ^t(||x||) for gfx) and 
< 7 +(||x||) for g + (x). Setting E r (-) to denote the conditional expectation E(* | \\X || = r) for brevity, 
it suffices to show that for r > Vna 2 + c 


E, (ItoPQX - e\\ 2 ) < E r (| \g+(X)X - £|| 2 ) 
g\(r) 2 r 2 - 2gVr)E r (X, 6) < <?+(r)V - 2g + (r)E r (X, 9) 

W r ) 2 - g+VYV 2 > 2 (^t( r ) - g+( r )) E r{x,e) 

(gVr)+g + (r))r 2 >2E r (X,6). (A.22) 


On the other hand, we have 


(gV r ) + g +( r )) r2 > 




r 


2 


= ||#|| 2 + r 2 — na 2 

= ||0|| 2 + r 2 - 2E,.(X, 9) - na 2 + 2E r (X, 9) 
= E r ||X - 9\\ 2 - na 2 + 2E r (X, 9) 

> 2E r (X, 9) 


where the last inequality is because 

||X-0|| 2 >(||Al-||0||) 2 >na 2 . 

Thus, (A.22) holds and hence E||^- — ^H 2 < E||0 + — 9\\ 2 . □ 

Proof of Lemma A.9. It is easy to see that Vi < V 2 , because for any 9 the inside minimum is 
smaller for (Ax) than for (/1 2 ). Next, we will show Vi >V 2 . 
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Suppose that 9* achieves the value V 2 , with corresponding b* and uj*. We claim that 9* is non¬ 
increasing. In fact, if 9* is not non-increasing, then there must exist an index j such that 9* < 9* j+1 
and for simplicity let’s assume that 91 < 9*. We are going to show that this leads to b\ = b* 2 and 

uj\ = u> 2 . Write 


Sl 


9? 


9f + e 2 ’ 


s 2 


0 ? 

9*? + ^' 


We have si < s 2 . Let b* = ''' and observe that b\ > b* > b* 2 . Notice that 


(si 2- 26 ‘ + s 2 2~ 2b i) - (si2~ 25 * + s 2 2- 25 *) 

= 5l (2“^ - 2- 25 *) + s 2 (V 26 * - 2- 25 *) 

> 5 2 (2- 2 ^ - 2- 2 " b ‘) + S2 (2" 26 S - 2- 25 *) 

> s 2 (2 -26 * + 2“ 2b 2 - 2 • 2~ 25 *) 


> 0 , 


where equality holds if and only if b\ = b 2 , since s 2 > s± > 0. Hence, b\ and b 2 have to be equal, 
or otherwise it would contradict with the assumption that b* achieves the inside minimum of (A 2 ). 
Now turn to uj*. Write uj* = " and note that uj* > O* > uj 2 . Consider 

((1 - uj* 1 ) 2 0? + UJ? e 2 ) + ((1 - uj* 2 ) 2 9? + UJ?£ 2 ) - ((1 - uj*) 2 (9? + 9?) + 2 u* 2 e 2 ) 

= ((1 - uj? 2 - (1 - f*) 2 ) 9? + ((1 - u* 2 ) 2 - (1 - uj*) 2 ) 9? + (uj? + uj* 2 2 - 2 u* 2 ) e 2 

> ((1 - uj*? 2 - (1 - uj*) 2 ) 9*? + ((1 - uj* 2 ) 2 - (1 - uj*) 2 ) 9? + (uj? + uj? - 2 uj* 2 ) e 2 

= ((1 - uj? 2 + (1 - uj* 2 ) 2 - 2(1 - uj*) 2 ) 9? + (uj? + uj? - 2uj* 2 ) e 2 

> 0 , 


where the equality holds if and only if uj\ = uj 2 . Therefore, and uj* 2 must be equal. Now, with 
b\ = b* 2 and oo\ = uj 2 , we can switch 9* and 9* 2 without increasing the objective function and 
violating the constraints. Thus, our claim that 9* is non-increasing is justified. 

Now that we have shown that the solution triplet (9*,b*,uj*) to (v4 2 ) satisfy that 9* is non¬ 
increasing, in order to prove V\ > V 2 , it suffices to show that if we take 0 = 9* in the 
minimizer b* is non-increasing and b\ < b max . In fact, if so, we will have 6* = b* as well as 

q*2 

uj* = e J +£ n and then 


Ci > 


N 

r. 


Of 


0?£ 2 


2 - h o " J 


+ 


0 ? + e 2 / 


> V 2 . 


>j<bU W + £2 ‘ 

Let’s take 9 = 9* in (Ax). The optimal b* is non-increasing because the solution is given by the 
“reverse water-filling” scheme and 9* is non-increasing. Next, we will show that b* < 6 rnax . If 
b\ > 6 max , then we would have for j = 1,..., N 

9? 


9? 


9? + £ 2 


2~ 2b *J < 


2~ 2b \ < 


9? + e 2 


^ 2 o 2 brt 


< c 2 2 _41o s( 1 / £ ) = c C 4 , 
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where the first inequality follows from the “reverse water-filling” solution, and therefore 

N 0*4 

A - 1—2 ft* 


^ 0*2 , £ 2 
.7=1 °3 ^ t 


2~ Zb i < Nc 2 £ 4 = 0 (£^+ 1 ), 


which would not give the optimal solution. Hence, b\ < b max , and this completes the proof. 


□ 


Lemma A.ll. Suppose that W n £ follows a non-central chi-square distribution with n degrees of 
freedom and non-centrality parameter £. We have for n > 5 


and for n > 7 


E (Wnl) < 


1 

n + £- 4’ 


E i W nT) - ( n _ 6 )( n + f - 4)(n + f - 6) • 

Proof It is well known that the non-central chi-square random variable W n ^ 
a Poisson-weighted mixture of central chi-square distributions, i.e., W n ^ ~ 
Poisson(£/2). Then 


can be written as 
xl+2 k with K ~ 


E(W-l) = E(E(W n -pK)) = E^^_) 

1 1 

> - = - 

n + 2EA — 2 n + £ — 2 

where we have used the fact that E(l/ xf) = n — 2 and Jensen’s inequality. Similarly, we have 


E (W~l) = E (E(W-| | K)) = E 


(n + 2 K - 2 )(n + 2AT - 4) 


> 


(n + 2EAT - 2)(n + 2EAT - 4) 


1 

(n + t; — 2) (n + £ — 4) 

Using the Poisson-weighted mixture representation, the following recurrence relation can be de¬ 
rived [7] 


1 = fE (W-p ( ) + »E (W£ u ) , (A.23) 

E (lV~!) = (E (W~P ( ) + nE (W ~\ t ), (A.24) 

for n > 3. Thus, 

E(w- + y = l-|E(tv- + y 

^ 1 n 1 
i 

rc + f' 
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Replacing n by n — 4 proves (A.l 1). On the other hand, rearranging (A.23), we get 


< 1 - ^ - 1 - 

n nn + £ + 2 

n + 2 

n(n + £ + 2) ‘ 

Now using (A.24), we have 

E(HCkj) = iE(W^) - ^E(W'-+ 2 2, £ ) 

n n 

~ f (n - 2)(n + 0 f(n + f)( n + £ ~ 2 ) 
n 

( n — 2) (n + £) (n + £ — 2) ’ 

Replacing n by n — 4 proves (A.l 1). Q 
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